Zhaoxi Zhang (University of Technology Sydney), Xiaomei Zhang (Griffith University), Yanjun Zhang (University of Technology Sydney), He Zhang (RMIT University), Shirui Pan (Griffith University), Bo Liu (University of Technology Sydney), Asif Gill (University of Technology Sydney Australia), Leo Yu Zhang (Griffith University)

Large Language Model (LLM) watermark has emerged as a promising technique for copyright protection, misuse prevention, and machine-generated content detection. It injects detectable signals during the LLM generation process, allowing for later identification by a corresponding detector. To assess the robustness of watermark schemes, existing studies typically adopt watermark removal attacks, which aim to erase embedded signals by modifying the watermarked text. However, we reveal that existing watermark removal attacks are suboptimal, which leads to the misconception that effective watermark removal requires either a large perturbation budget or a strong adversary’s capabilities, such as unlimited queries to the victim LLM or its watermark detector. A systematic scrutinization of removal attack capabilities as well as the development of more sophisticated techniques remains largely underexplored. As a result, the robustness of existing watermarking schemes may be overestimated.

To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal compared to token-level or sentence-level approaches under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments across five representative watermarking schemes and two widely-used LLMs consistently confirm the superiority of character-level perturbations and the effectiveness of the reference-detector-guided GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

View More Papers

Cirrus: Performant and Accountable Distributed SNARK

Wenhao Wang (Yale University, IC3), Fangyan Shi (Tsinghua University), Dani Vilardell (Cornell University, IC3), Fan Zhang (Yale University, IC3)

Read More

Characterizing the Implementation of Censorship Policies in Chinese LLM...

Anna Ablove (University of Michigan), Shreyas Chandrashekaran (University of Michigan), Xiao Qiang (University of California at Berkeley), Roya Ensafi (University of Michigan)

Read More

Through the Authentication Maze: Detecting Authentication Bypass Vulnerabilities in...

Nanyu Zhong (Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Key Laboratory of Network Assessment Technology, Chinese Academy of Sciences; Beijing Key Laboratory of Network Security and Protection Technology), Yuekang Li (University of New South Wales), Yanyan Zou (Institute of Information Engineering, Chinese Academy of…

Read More