From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

TOP 文献データベース From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2407.02855

PDF

https://arxiv.org/pdf/2407.02855

文献情報

作者: Zhexin Zhang,Junxiao Yang,Yida Lu,Pei Ke,Shiyao Cui,Chujie Zheng,Hongning Wang,Minlie Huang
公開日: 2024-7-3
更新日: 2025-5-20
所属機関: The Conversational AI (CoAI) group, DCST, Tsinghua University
所属の国: China
会議名

AIにより推定されたラベル

大規模言語モデル法執行回避プロンプトインジェクション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) are known to be vulnerable to jailbreak attacks. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Consequently, unlearning-based approaches have been proposed to mitigate jailbreak attacks by directly removing harmful knowledge from the model. In this paper, we identify a novel ripple effect of unlearning, wherein LLMs can implicitly unlearn harmful knowledge that was not explicitly introduced during the unlearning phase (e.g., a model unlearning the steps for theft may also implicitly unlearn the steps for making a bomb). Through over 100 experimental runs spanning multiple models, attack strategies, and defense methods, we empirically validate this phenomenon, which makes unlearning-based methods able to decrease the Attack Success Rate on unseen data from more than 70% to less than 10% with only 100 training samples. Further analysis reveals that the strong generalization ability of unlearning may stem from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions in response, and similarity among their learned representations in the LLM). We also discuss the potential limitations of unlearning and the observed ripple effect. We hope our research could contribute to a deeper understanding of unlearning. Our code is available at https://github.com/thu-coai/SafeUnlearning.

外部データセット

HarmBench

UltraChat

MTBench

MMLU

XSTest