Detecting Benchmark Contamination Through Watermarking

TOP Literature Database Detecting Benchmark Contamination Through Watermarking

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2502.17259

PDF

https://arxiv.org/pdf/2502.17259

Paper Information

Author: Tom Sander,Pierre Fernandez,Saeed Mahloujifar,Alain Durmus,Chuan Guo
Published: 2-25-2025
Updated: 7-22-2025
Affiliation: Meta FAIR
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Data Contamination Detection Watermarking Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, \eg $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

External Datasets

ARC-Easy

ARC-Challenge

MMLU

GSM8K

GPQA

FrontierMath