These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Watermarking for large language models (LLMs) embeds a statistical signal
during generation to enable detection of model-produced text. While
watermarking has proven effective in benign settings, its robustness under
adversarial evasion remains contested. To advance a rigorous understanding and
evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion
Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic.
BIRA weakens the watermark signal by suppressing the logits of likely
watermarked tokens during LLM-based rewriting, without any knowledge of the
underlying watermarking scheme. Across recent watermarking methods, BIRA
achieves over 99\% evasion while preserving the semantic content of the
original text. Beyond demonstrating an attack, our results reveal a systematic
vulnerability, emphasizing the need for stress testing and robust defenses.