Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

TOP 文献データベース Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2505.16831

PDF

https://arxiv.org/pdf/2505.16831

文献情報

作者: Xiaoyu Xu,Xiang Yue,Yang Liu,Qingqing Ye,Haibo Hu,Minxin Du
公開日: 2025-5-23
所属機関: The Hong Kong Polytechnic University
所属の国: Hong Kong
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

AIによる出力のバイアスの検出プライバシー管理マシン・アンラーニング

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.