These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The integrity of peer review is fundamental to scientific progress, but the
rise of large language models (LLMs) has introduced concerns that some
reviewers may rely on these tools to generate reviews rather than writing them
independently. Although some venues have banned LLM-assisted reviewing,
enforcement remains difficult as existing detection tools cannot reliably
distinguish between fully generated reviews and those merely polished with AI
assistance. In this work, we address the challenge of detecting LLM-generated
reviews. We consider the approach of performing indirect prompt injection via
the paper's PDF, prompting the LLM to embed a covert watermark in the generated
review, and subsequently testing for presence of the watermark in the review.
We identify and address several pitfalls in na\"ive implementations of this
approach. Our primary contribution is a rigorous watermarking and detection
framework that offers strong statistical guarantees. Specifically, we introduce
watermarking schemes and hypothesis tests that control the family-wise error
rate across multiple reviews, achieving higher statistical power than standard
corrections such as Bonferroni, while making no assumptions about the nature of
human-written reviews. We explore multiple indirect prompt injection
strategies--including font-based embedding and obfuscated prompts--and evaluate
their effectiveness under various reviewer defense scenarios. Our experiments
find high success rates in watermark embedding across various LLMs. We also
empirically find that our approach is resilient to common reviewer defenses,
and that the bounds on error rates in our statistical tests hold in practice.
In contrast, we find that Bonferroni-style corrections are too conservative to
be useful in this setting.