Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks.
An important observation is that, while different types of jailbreak attacks
can generate significantly different queries, they mostly result in similar
responses that are rooted in the same harmful knowledge (e.g., detailed steps
to make a bomb). Consequently, unlearning-based approaches have been proposed
to mitigate jailbreak attacks by directly removing harmful knowledge from the
model. In this paper, we identify a novel ripple effect of unlearning, wherein
LLMs can implicitly unlearn harmful knowledge that was not explicitly
introduced during the unlearning phase (e.g., a model unlearning the steps for
theft may also implicitly unlearn the steps for making a bomb). Through over
100 experimental runs spanning multiple models, attack strategies, and defense
methods, we empirically validate this phenomenon, which makes unlearning-based
methods able to decrease the Attack Success Rate on unseen data from more than
70% to less than 10% with only 100 training samples. Further analysis reveals
that the strong generalization ability of unlearning may stem from the
intrinsic relatedness among harmful responses across harmful questions (e.g.,
response patterns, shared steps and actions in response, and similarity among
their learned representations in the LLM). We also discuss the potential
limitations of unlearning and the observed ripple effect. We hope our research
could contribute to a deeper understanding of unlearning. Our code is available
at https://github.com/thu-coai/SafeUnlearning.