Robust LLM safeguarding via refusal feature adversarial training

TOP 文献データベース Robust LLM safeguarding via refusal feature adversarial training

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2409.20089

PDF

https://arxiv.org/pdf/2409.20089

文献情報

作者: Lei Yu,Virginie Do,Karen Hambardzumyan,Nicola Cancedda
公開日: 2024-9-30
更新日: 2025-3-21
所属機関: University of Toronto, Meta FAIR
所属の国: Canada
会議名: International Conference on Learning Representations (ICLR)

AIにより推定されたラベル

プロンプトインジェクションモデルの堅牢性敵対的学習

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

外部データセット

AdvBench

Alpaca

UltraChat

XSTest

HarmBench