Rui Zeng (Zhejiang University), Xi Chen (Zhejiang University), Yuwen Pu (Zhejiang University), Xuhong Zhang (Zhejiang University), Tianyu Du (Zhejiang University), Shouling Ji (Zhejiang University)

Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed tokens, words, phrases, or sentences used in the textit{static} text trigger, textit{dynamic} backdoor attacks on NLP models design triggers associated with abstract and latent text features (e.g., style), making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while research on detecting dynamic backdoors in NLP models remains largely unexplored.

This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. At a high level, CLIBE injects a textit{"few-shot perturbation"} into the suspect Transformer model by crafting an optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the textit{generalization} capability of this "few-shot perturbation" to determine whether the original suspect model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness and generality of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one model exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of the backdoor behavior of this model. Moreover, we show that CLIBE can be easily extended to detect backdoor text generation models (e.g., GPT-Neo-1.3B) that are modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without requiring access to trigger input test samples. The code is available at https://github.com/Raytsang123/CLIBE.

View More Papers

Distributed Function Secret Sharing and Applications

Pengzhi Xing (University of Electronic Science and Technology of China), Hongwei Li (University of Electronic Science and Technology of China), Meng Hao (Singapore Management University), Hanxiao Chen (University of Electronic Science and Technology of China), Jia Hu (University of Electronic Science and Technology of China), Dongxiao Liu (University of Electronic Science and Technology of China)

Read More

EAGLEYE: Exposing Hidden Web Interfaces in IoT Devices via...

Hangtian Liu (Information Engineering University), Lei Zheng (Institute for Network Sciences and Cyberspace (INSC), Tsinghua University), Shuitao Gan (Laboratory for Advanced Computing and Intelligence Engineering), Chao Zhang (Institute for Network Sciences and Cyberspace (INSC), Tsinghua University), Zicong Gao (Information Engineering University), Hongqi Zhang (Henan Key Laboratory of Information Security), Yishun Zeng (Institute for Network Sciences…

Read More

NodeMedic-FINE: Automatic Detection and Exploit Synthesis for Node.js Vulnerabilities

Darion Cassel (Carnegie Mellon University), Nuno Sabino (IST & CMU), Min-Chien Hsu (Carnegie Mellon University), Ruben Martins (Carnegie Mellon University), Limin Jia (Carnegie Mellon University)

Read More