An Adversarial Perspective on Machine Unlearning for AI Safety

TOP Literature Database An Adversarial Perspective on Machine Unlearning for AI Safety

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2409.18025

PDF

https://arxiv.org/pdf/2409.18025

Paper Information

Author: Jakub Łucki,Boyi Wei,Yangsibo Huang,Peter Henderson,Florian Tramèr,Javier Rando
Published: 9-27-2024
Updated: 4-10-2025
Affiliation: ETH Zurich
Country: Switzerland
Conference

Labels Estimated by AI

Prompt Injection Machine Unlearning Safety Alignment

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

External Datasets

WMDP benchmark

WikiText

bio-forget-corpus

bio-retain-corpus

cyber-forget-corpus

cyber-retain-corpus