Zhaoxi Zhang (University of Technology Sydney), Xiaomei Zhang (Griffith University), Yanjun Zhang (University of Technology Sydney), He Zhang (RMIT University), Shirui Pan (Griffith University), Bo Liu (University of Technology Sydney), Asif Gill (University of Technology Sydney Australia), Leo Yu Zhang (Griffith University)

Large Language Model (LLM) watermark has emerged as a promising technique for copyright protection, misuse prevention, and machine-generated content detection. It injects detectable signals during the LLM generation process, allowing for later identification by a corresponding detector. To assess the robustness of watermark schemes, existing studies typically adopt watermark removal attacks, which aim to erase embedded signals by modifying the watermarked text. However, we reveal that existing watermark removal attacks are suboptimal, which leads to the misconception that effective watermark removal requires either a large perturbation budget or a strong adversary’s capabilities, such as unlimited queries to the victim LLM or its watermark detector. A systematic scrutinization of removal attack capabilities as well as the development of more sophisticated techniques remains largely underexplored. As a result, the robustness of existing watermarking schemes may be overestimated.

To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal compared to token-level or sentence-level approaches under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments across five representative watermarking schemes and two widely-used LLMs consistently confirm the superiority of character-level perturbations and the effectiveness of the reference-detector-guided GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

View More Papers

TranSPArent: Taint-style Vulnerability Detection in Generic Single Page Applications...

Senapati Diwangkara (Johns Hopkins University), Yinzhi Cao (Johns Hopkins University)

Read More

RTCON: Context-Adaptive Function-Level Fuzzing for RTOS Kernels

Eunkyu Lee (KAIST School of Electrical Engineering), Junyoung Park (KAIST School of Electrical Engineering), Insu Yun (KAIST School of Electrical Engineering)

Read More

Repairing Trust in Domain Name Disputes Practices: Insights from...

Vinny Adjibi (Georgia Institute of Technology), Athanasios Avgetidis (Georgia Institute of Technology), Manos Antonakakis (Georgia Institute of Technology), Alberto Dainotti (Georgia Institute of Technology), Michael Bailey (Georgia Institute of Technology), Fabian Monrose (Georgia Institute of Technology)

Read More