These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models are typically trained on datasets collected from the
web, which may inadvertently contain harmful or sensitive personal information.
To address growing privacy concerns, unlearning methods have been proposed to
remove the influence of specific data from trained models. Of these, exact
unlearning -- which retrains the model from scratch without the target data --
is widely regarded the gold standard for mitigating privacy risks in
deployment. In this paper, we revisit this assumption in a practical deployment
setting where both the pre- and post-unlearning logits API are exposed, such as
in open-weight scenarios. Targeting this setting, we introduce a novel data
extraction attack that leverages signals from the pre-unlearning model to guide
the post-unlearning model, uncovering patterns that reflect the removed data
distribution. Combining model guidance with a token filtering strategy, our
attack significantly improves extraction success rates -- doubling performance
in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP.
Furthermore, we demonstrate our attack's effectiveness on a simulated medical
diagnosis dataset to highlight real-world privacy risks associated with exact
unlearning. In light of our findings, which suggest that unlearning may, in a
contradictory way, increase the risk of privacy leakage during real-world
deployments, we advocate for evaluation of unlearning methods to consider
broader threat models that account not only for post-unlearning models but also
for adversarial access to prior checkpoints. Code is publicly available at:
https://github.com/Nicholas0228/unlearned_data_extraction_llm.