What is in Your Safe Data? Identifying Benign Data that Breaks Safety

TOP Literature Database What is in Your Safe Data? Identifying Benign Data that Breaks Safety

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2404.01099

PDF

https://arxiv.org/pdf/2404.01099

Paper Information

Author: Luxi He;Mengzhou Xia;Peter Henderson
Published: 4-1-2024
Updated: 8-21-2024
Affiliation: Princeton Language and Intelligence (PLI), Princeton University
Country: United States of America
Conference

Labels Estimated by AI

Data Selection Strategy Prompt Injection Psychological Manipulation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Additionally, we propose a bi-directional anchoring method that, during the selection process, prioritizes data points that are close to harmful examples and far from benign ones. Our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints surprisingly leads to the fine-tuned model affirmatively responding to >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We also observe that the selected data frequently appear as lists, bullet points, or math questions, indicating a systematic pattern in fine-tuning data that contributes to jailbreaking.

External Datasets

ALPACA

DOLLY

PURE-BAD

GSM8K