Optimizing Adaptive Attacks against Watermarks for Language Models

TOP Literature Database Optimizing Adaptive Attacks against Watermarks for Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2410.02440

PDF

https://arxiv.org/pdf/2410.02440

Paper Information

Author: Abdulrahman Diaa,Toluwani Aremu,Nils Lukas
Published: 10-3-2024
Updated: 5-21-2025
Affiliation: David R. Cheriton School of Computer Science, University of Waterloo
Country: Canada
Conference: International Conference on Machine Learning (ICML)

Labels Estimated by AI

Prompt Injection LLM Security Watermarking

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.

External Datasets

preference dataset

synthetic training dataset