Zhaoxi Zhang (University of Technology Sydney), Xiaomei Zhang (Griffith University), Yanjun Zhang (University of Technology Sydney), He Zhang (RMIT University), Shirui Pan (Griffith University), Bo Liu (University of Technology Sydney), Asif Qumer Gill (University of Technology Sydney Australia), Leo Zhang (Griffith University)

Large Language Model (LLM) watermark has emerged as a promising technique for copyright protection, misuse prevention, and machine-generated content detection. It injects detectable signals during the LLM generation process, allowing for later identification by a corresponding detector. To assess the robustness of watermark schemes, existing studies typically adopt watermark removal attacks, which aim to erase embedded signals by modifying the watermarked text. However, we reveal that existing watermark removal attacks are suboptimal, which leads to the misconception that effective watermark removal requires either a large perturbation budget or a strong adversary’s capabilities, such as unlimited queries to the victim LLM or its watermark detector. A systematic scrutinization of removal attack capabilities as well as the development of more sophisticated techniques remains largely underexplored. As a result, the robustness of existing watermarking schemes may be overestimated.

To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal compared to token-level or sentence-level approaches under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments across five representative watermarking schemes and two widely-used LLMs consistently confirm the superiority of character-level perturbations and the effectiveness of the reference-detector-guided GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

View More Papers

“I wanted to buy Robux but got scammed for...

Lily Klucinec (Carnegie Mellon University), Ellie Young (Carnegie Mellon University), Elijah Bouma-Sims (Carnegie Mellon University), Lorrie Faith Cranor (Carnegie Mellon University)

Read More

vSim: Semantics-Aware Value Extraction for Efficient Binary Code Similarity...

Huaijin Wang (The Ohio State University), Zhiqiang Lin (The Ohio State University)

Read More

Work-in-progress: Building Next-Generation Datasets for Provenance-Based Intrusion Detection

Qizhi Cai (Zhejiang University), Lingzhi Wang (Northwestern University), Yao Zhu, Zhipeng Chen (Zhejiang University), Xiangmin Shen (Hofstra University), Zhenyuan LI (Zhejiang University)

Read More