Unlearning in large language models (LLMs) is intended to remove the
influence of specific data, yet current evaluations rely heavily on token-level
metrics such as accuracy and perplexity. We show that these metrics can be
misleading: models often appear to forget, but their original behavior can be
rapidly restored with minimal fine-tuning, revealing that unlearning may
obscure information rather than erase it. To diagnose this phenomenon, we
introduce a representation-level evaluation framework using PCA-based
similarity and shift, centered kernel alignment, and Fisher information.
Applying this toolkit across six unlearning methods, three domains (text, code,
math), and two open-source LLMs, we uncover a critical distinction between
reversible and irreversible forgetting. In reversible cases, models suffer
token-level collapse yet retain latent features; in irreversible cases, deeper
representational damage occurs. We further provide a theoretical account
linking shallow weight perturbations near output layers to misleading
unlearning signals, and show that reversibility is modulated by task type and
hyperparameters. Our findings reveal a fundamental gap in current evaluation
practices and establish a new diagnostic foundation for trustworthy unlearning
in LLMs. We provide a unified toolkit for analyzing LLM representation changes
under unlearning and relearning:
https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.