Detecting Benchmark Contamination Through Watermarking

Authors: Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo | Published: 2025-02-24 | Updated: 2025-07-21

2025.02.242025.07.23

Authors: Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo
Published: 2025-02-24 | Updated: 2025-07-21

Source: https://arxiv.org/abs/2502.17259

PDF: https://arxiv.org/pdf/2502.17259

Labels Predicted by AI

Data Contamination Detection Watermarking Performance Evaluation

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect “radioactivity”, traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, p-val = 10^− 3 for +5% on ARC-Easy.