Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University)

The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values.
To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks.
However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities.
To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses.
The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing.
Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses.
In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses.
We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model.
In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.

View More Papers

RadSee: See Your Handwriting Through Walls Using FMCW Radar

Shichen Zhang (Michigan State University), Qijun Wang (Michigan State University), Maolin Gan (Michigan State University), Zhichao Cao (Michigan State University), Huacheng Zeng (Michigan State University)

Read More

Moneta: Ex-Vivo GPU Driver Fuzzing by Recalling In-Vivo Execution...

Joonkyo Jung (Department of Computer Science, Yonsei University), Jisoo Jang (Department of Computer Science, Yonsei University), Yongwan Jo (Department of Computer Science, Yonsei University), Jonas Vinck (DistriNet, KU Leuven), Alexios Voulimeneas (CYS, TU Delft), Stijn Volckaert (DistriNet, KU Leuven), Dokyung Song (Department of Computer Science, Yonsei University)

Read More

Secret Spilling Drive: Leaking User Behavior through SSD Contention

Jonas Juffinger (Graz University of Technology), Fabian Rauscher (Graz University of Technology), Giuseppe La Manna (Amazon), Daniel Gruss (Graz University of Technology)

Read More

Attributing Open-Source Contributions is Critical but Difficult: A Systematic...

Jan-Ulrich Holtgrave (CISPA Helmholtz Center for Information Security), Kay Friedrich (CISPA Helmholtz Center for Information Security), Fabian Fischer (CISPA Helmholtz Center for Information Security), Nicolas Huaman (Leibniz University Hannover), Niklas Busch (CISPA Helmholtz Center for Information Security), Jan H. Klemmer (CISPA Helmholtz Center for Information Security), Marcel Fourné (Paderborn University), Oliver Wiese (CISPA Helmholtz Center…

Read More