RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

TOP 文献データベース RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2604.11546

PDF

https://arxiv.org/pdf/2604.11546

文献情報

作者: Hanbo Huang,Xuan Gong,Yiran Zhang,Hao Zheng,Shiyu Liang
公開日: 2026-4-13
所属機関: Shanghai Jiao Tong University
所属の国: China
会議名

AIにより推定されたラベル

敵対的学習透かし設計攻撃戦略分析

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

外部データセット

C4-RealNewslike subset

Reddit WritingPrompts

LFQA

BookReport subset of MMW

FakeNews subset of MMW