These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
As large language models (LLMs) are overwhelmingly more and more integrated
into various applications, ensuring they generate safe responses is a pressing
need. Previous studies on alignment have largely focused on general
instruction-following but have often overlooked the distinct properties of
safety alignment, such as the brittleness of safety mechanisms. To bridge the
gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which
posits that safety alignment teaches an otherwise unsafe model to choose the
correct reasoning direction - fulfill or refuse users' requests - interpreted
as an implicit binary classification task. Through SSAH, we hypothesize that
only a few essential components can establish safety guardrails in LLMs. We
successfully identify four types of attribute-critical components: Safety
Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and
Redundant Unit (RU). Our findings show that freezing certain safety-critical
components during fine-tuning allows the model to retain its safety attributes
while adapting to new tasks. Similarly, we show that leveraging redundant units
in the pre-trained model as an "alignment budget" can effectively minimize the
alignment tax while achieving the alignment goal. All considered, this paper
concludes that the atomic functional unit for safety in LLMs is at the neuron
level and underscores that safety alignment should not be complicated.