Yingjie Zhang (Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences), Tong Liu (Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences), Zhe Zhao (Ant Group), Guozhu Meng (Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences), Kai Chen (Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences)

Large Language Models (LLMs) remain vulnerable to jailbreak attacks that exploit adversarial prompts to circumvent safety measures. Current safety fine-tuning approaches face two critical limitations. First, they often fail to strike a balance between security and utility, where stronger safety measures tend to over-reject harmless user requests. Second, they frequently miss malicious intent concealed within seemingly benign tasks, leaving models exposed to exploitation. Our work identifies a fundamental cause of these issues: during response generation, an LLM's capacity to differentiate harmful from safe outputs deteriorates. Experimental evidence confirms this, revealing that the separability between hidden states for safe and harmful responses diminishes as generation progresses. This weakening discrimination forces models to make compliance judgments earlier in the generation process, restricting their ability to recognize developing harmful intent and contributing to both aforementioned failures. To mitigate this vulnerability, we introduce DEEPALIGN - an inherent defense framework that enhances the safety of LLMs. By applying contrastive hidden-state steering at the midpoint of response generation, DEEPALIGN amplifies the separation between harmful and benign hidden states, enabling continuous intrinsic toxicity detection and intervention throughout the generation process. Moreover, it facilitates contextually appropriate safe responses to harmful queries, thereby expanding the feasible space of safe responses. Evaluations demonstrate DEEPALIGN's efficacy. Across diverse LLMs spanning varying architectures and scales, it reduced attack success rates of nine distinct jailbreak attacks to near-zero or minimal. Crucially, it preserved model capability while reducing over-refusal. Models equipped with DEEPALIGN exhibited up to 3.5% lower error rates in rejecting challenging benign queries and maintained standard task performance with less than 1% decline. This marks a substantial advance in the safety-utility Pareto frontier.

View More Papers

Analysis of the Security Design, Engineering, and Implementation of...

Alan T. Sherman (University of Maryland, Baltimore County (UMBC)), Jeremy J. Romanik Romano (University of Maryland, Baltimore County (UMBC)), Edward Zieglar (University of Maryland, Baltimore County (UMBC)), Enis Golaszewski (University of Maryland, Baltimore County (UMBC)), Jonathan D. Fuchs (University of Maryland, Baltimore County (UMBC)), William E. Byrd (University of Alabama at Birmingham)

Read More

Before the Vicious Cycle Starts: Preventing Burnout Across SOC...

Kashyap Thimmaraju (Technische Universitat Berlin), Duc Anh Hoang (Technische Universitat Berlin), Souradip Nath (Arizona State University), Jaron Mink (Arizona State University), Gail-Joon Ahn (Arizona State University)

Read More

Demystifying the Access Control Mechanism of ESXi VMKernel

Yue Liu (Southeast University), Zexiang Zhang (National University of Defense Technology), Jiaxun Zhu (Zhejiang University), Hao Zheng (Independent Researcher), Jiaqing Huang (Independent Researcher), Wenbo Shen (Zhejiang University), Gaoning Pan (Hangzhou Dianzi University), Yuliang Lu (National University of Defense Technology), Min Zhang (National University of Defense Technology), Zulie Pan (National University of Defense Technology), Guang Cheng…

Read More