Safety Alignment Should Be Made More Than Just A Few Attention Heads

TOP 文献データベース Safety Alignment Should Be Made More Than Just A Few Attention Heads

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2508.19697

PDF

https://arxiv.org/pdf/2508.19697

文献情報

作者: Chao Huang,Zefeng Zhang,Juewei Yue,Quangang Li,Chuang Zhang,Tingwen Liu
公開日: 2025-8-27
所属機関: Institute of Information Engineering, Chinese Academy of Sciences
所属の国: China
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

プロンプトインジェクション注意メカニズム大規模言語モデル

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety measures.Our investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model's refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention heads. Experimental results demonstrate that AHD successfully distributes safety-related capabilities across more attention heads. Moreover, evaluations under several mainstream jailbreak attacks show that models trained with AHD exhibit considerably stronger safety robustness, while maintaining overall functional utility.

外部データセット

AdvBench

256 harmful instructions

Alpaca