Rui Zeng (Zhejiang University), Xi Chen (Zhejiang University), Yuwen Pu (Zhejiang University), Xuhong Zhang (Zhejiang University), Tianyu Du (Zhejiang University), Shouling Ji (Zhejiang University)

Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed tokens, words, phrases, or sentences used in the textit{static} text trigger, textit{dynamic} backdoor attacks on NLP models design triggers associated with abstract and latent text features (e.g., style), making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while research on detecting dynamic backdoors in NLP models remains largely unexplored.

This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. At a high level, CLIBE injects a textit{"few-shot perturbation"} into the suspect Transformer model by crafting an optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the textit{generalization} capability of this "few-shot perturbation" to determine whether the original suspect model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness and generality of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one model exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of the backdoor behavior of this model. Moreover, we show that CLIBE can be easily extended to detect backdoor text generation models (e.g., GPT-Neo-1.3B) that are modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without requiring access to trigger input test samples. The code is available at https://github.com/Raytsang123/CLIBE.

View More Papers

Statically Discover Cross-Entry Use-After-Free Vulnerabilities in the Linux Kernel

Hang Zhang (Indiana University Bloomington), Jangha Kim (The Affiliated Institute of ETRI, ROK), Chuhong Yuan (Georgia Institute of Technology), Zhiyun Qian (University of California, Riverside), Taesoo Kim (Georgia Institute of Technology)

Read More

Towards LLM-Assisted Vulnerability Detection and Repair for Open-Source 5G...

Rupam Patir (University at Buffalo), Qiqing Huang (University at Buffalo), Keyan Guo (University at Buffalo), Wanda Guo (Texas A&M University), Guofei Gu (Texas A&M University), Haipeng Cai (University at Buffalo), Hongxin Hu (University at Buffalo)

Read More

Heimdall: Towards Risk-Aware Network Management Outsourcing

Yuejie Wang (Peking University), Qiutong Men (New York University), Yongting Chen (New York University Shanghai), Jiajin Liu (New York University Shanghai), Gengyu Chen (Carnegie Mellon University), Ying Zhang (Meta), Guyue Liu (Peking University), Vyas Sekar (Carnegie Mellon University)

Read More

GAP-Diff: Protecting JPEG-Compressed Images from Diffusion-based Facial Customization

Haotian Zhu (Nanjing University of Science and Technology), Shuchao Pang (Nanjing University of Science and Technology), Zhigang Lu (Western Sydney University), Yongbin Zhou (Nanjing University of Science and Technology), Minhui Xue (CSIRO's Data61)

Read More