These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The increasing parameters and expansive dataset of large language models
(LLMs) highlight the urgent demand for a technical solution to audit the
underlying privacy risks and copyright issues associated with LLMs. Existing
studies have partially addressed this need through an exploration of the
pre-training data detection problem, which is an instance of a membership
inference attack (MIA). This problem involves determining whether a given piece
of text has been used during the pre-training phase of the target LLM. Although
existing methods have designed various sophisticated MIA score functions to
achieve considerable detection performance in pre-trained LLMs, how to achieve
high-confidence detection and how to perform MIA on aligned LLMs remain
challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA
method, which instructs LLMs themselves to serve as a more precise pre-training
data detector internally, rather than design an external MIA score function.
Furthermore, we design two instruction-based safeguards to respectively
mitigate the privacy risks brought by the existing methods and MIA-Tuner. To
comprehensively evaluate the most recent state-of-the-art LLMs, we collect a
more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely
adopted benchmark WIKIMIA. We conduct extensive experiments across various
aligned and unaligned LLMs over the two benchmark datasets. The results
demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a
significantly high level of 0.9.