Rui Zeng (Zhejiang University), Xi Chen (Zhejiang University), Yuwen Pu (Zhejiang University), Xuhong Zhang (Zhejiang University), Tianyu Du (Zhejiang University), Shouling Ji (Zhejiang University)

Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed tokens, words, phrases, or sentences used in the textit{static} text trigger, textit{dynamic} backdoor attacks on NLP models design triggers associated with abstract and latent text features (e.g., style), making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while research on detecting dynamic backdoors in NLP models remains largely unexplored.

This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. At a high level, CLIBE injects a textit{"few-shot perturbation"} into the suspect Transformer model by crafting an optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the textit{generalization} capability of this "few-shot perturbation" to determine whether the original suspect model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness and generality of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one model exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of the backdoor behavior of this model. Moreover, we show that CLIBE can be easily extended to detect backdoor text generation models (e.g., GPT-Neo-1.3B) that are modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without requiring access to trigger input test samples. The code is available at https://github.com/Raytsang123/CLIBE.

View More Papers

From Large to Mammoth: A Comparative Evaluation of Large...

Jie Lin (University of Central Florida), David Mohaisen (University of Central Florida)

Read More

TME-Box: Scalable In-Process Isolation through Intel TME-MK Memory Encryption

Martin Unterguggenberger (Graz University of Technology), Lukas Lamster (Graz University of Technology), David Schrammel (Graz University of Technology), Martin Schwarzl (Cloudflare, Inc.), Stefan Mangard (Graz University of Technology)

Read More

Magmaw: Modality-Agnostic Adversarial Attacks on Machine Learning-Based Wireless Communication...

Jung-Woo Chang (University of California, San Diego), Ke Sun (University of California, San Diego), Nasimeh Heydaribeni (University of California, San Diego), Seira Hidano (KDDI Research, Inc.), Xinyu Zhang (University of California, San Diego), Farinaz Koushanfar (University of California, San Diego)

Read More

Vision: Towards True User-Centric Design for Digital Identity Wallets

Yorick Last (Paderborn University), Patricia Arias Cabarcos (Paderborn University)

Read More