LLM Watermark Evasion via Bias Inversion

TOP Literature Database LLM Watermark Evasion via Bias Inversion

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2509.23019

PDF

https://arxiv.org/pdf/2509.23019

Paper Information

Author: Jeongyeon Hwang,Sangdon Park,Jungseul Ok
Published: 9-27-2025
Updated: 10-2-2025
Affiliation: Pohang University of Science and Technology (POSTECH)
Country: South Korea
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Statistical Testing Disabling Safety Mechanisms of LLM Model Inversion

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

External Datasets