These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Current Large Language Models (LLMs), even those tuned for safety and
alignment, are susceptible to jailbreaking. Some have found that just further
fine-tuning an aligned model with benign data (i.e., data without harmful
content) surprisingly leads to substantial degradation in safety. We delve into
the data-centric aspects of why benign fine-tuning inadvertently contributes to
jailbreaking. First, we represent fine-tuning data through two lenses:
representation and gradient spaces. Additionally, we propose a bi-directional
anchoring method that, during the selection process, prioritizes data points
that are close to harmful examples and far from benign ones. Our approach
effectively identifies subsets of benign data that are more likely to degrade
the model's safety after fine-tuning. Training on just 100 of these seemingly
benign datapoints surprisingly leads to the fine-tuned model affirmatively
responding to >70% of tested harmful requests, compared to <20% after
fine-tuning on randomly selected data. We also observe that the selected data
frequently appear as lists, bullet points, or math questions, indicating a
systematic pattern in fine-tuning data that contributes to jailbreaking.