These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) can be misused to spread unwanted content at
scale. Content watermarking deters misuse by hiding messages in content,
enabling its detection using a secret watermarking key. Robustness is a core
security property, stating that evading detection requires (significant)
degradation of the content's quality. Many LLM watermarking methods have been
proposed, but robustness is tested only against non-adaptive attackers who lack
knowledge of the watermarking method and can find only suboptimal attacks. We
formulate watermark robustness as an objective function and use
preference-based optimization to tune adaptive attacks against the specific
watermarking method. Our evaluation shows that (i) adaptive attacks evade
detection against all surveyed watermarks, (ii) training against any watermark
succeeds in evading unseen watermarks, and (iii) optimization-based attacks are
cost-effective. Our findings underscore the need to test robustness against
adaptively tuned attacks. We release our adaptively optimized paraphrasers at
https://github.com/nilslukas/ada-wm-evasion.