Chasing Shadows: Pitfalls in LLM Security Research

Jonathan Evertz (CISPA Helmholtz Center for Information Security), Niklas Risse (Max Planck Institute for Security and Privacy), Nicolai Neuer (Karlsruhe Institute of Technology), Andreas Müller (Ruhr University Bochum), Philipp Normann (TU Wien), Gaetano Sapia (Max Planck Institute for Security and Privacy), Srishti Gupta (Sapienza University of Rome), David Pape (CISPA Helmholtz Center for Information Security), Soumya Shaw (CISPA Helmholtz Center for Information Security), Devansh Srivastav (CISPA Helmholtz Center for Information Security), Christian Wressnegger (Karlsruhe Institute of Technology), Erwin Quiring (_fbeta), Thorsten Eisenhofer (CISPA Helmholtz Center for Information Security), Daniel Arp (TU Wien), Lea Schönherr (CISPA Helmholtz Center for Information Security)

Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that have become (more) relevant with the emergence of LLMs and that can compromise the validity of research involving them. These pitfalls span the entire computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation.

We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized. To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.

Paper

Slides

Video

View More Papers

CHAMELEOSCAN: Demystifying and Detecting iOS Chameleon Apps via LLM-Powered...

Hongyu Lin (Zhejiang University), Yicheng Hu (Zhejiang University), Haitao Xu (Zhejiang University), Yanchen Lu (Zhejiang University), Mengxia Ren (Zhejiang University), Shuai Hao (Old Dominion University), Chuan Yue (Colorado School of Mines), Zhao Li (Hangzhou Yugu Technology), Fan Zhang (Zhejiang University), Yixin Jiang (Electric Power Research Institute, CSG)

Risk Assessment for ML-Based Applications in Satellite Systems

Simon Shigol (Ben Gurion University of the Negev), Roy Peled (Ben Gurion University of the Negev), Avishag Shapira (Ben Gurion University of the Negev), Yuval Elovici (Ben Gurion University of the Negev), Asaf Shabtai (Ben Gurion University of the Negev)

OCCUPY+PROBE: Cross-Privilege Branch Target Buffer Side-Channel Attacks at Instruction...

Kaiyuan Rong (Tsinghua University, Zhongguancun Laboratory), Junqi Fang (Tsinghua University, Zhongguancun Laboratory), Haixia Wang (Tsinghua University), Dapeng Ju (Tsinghua University, Zhongguancun Laboratory), Dongsheng Wang (Tsinghua University, Zhongguancun Laboratory)