Representation Bending for Large Language Model Safety

TOP 文献データベース Representation Bending for Large Language Model Safety

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2504.01550

PDF

https://arxiv.org/pdf/2504.01550

文献情報

作者: Ashkan Yousefpour,Taeheon Kim,Ryan S. Kwon,Seungbeen Lee,Wonje Jeung,Seungju Han,Alvin Wan,Harrison Ngan,Youngjae Yu,Jonghyun Choi
公開日: 2025-4-2
更新日: 2025-7-15
所属機関: Seoul National University
所属の国: South Korea
会議名: Annual Meeting of the Association for Computational Linguistics (ACL)

AIにより推定されたラベル

安全性アライメントプロンプトリーキングプロンプトインジェクション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model's behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.

外部データセット

WildGuardMix

WildJailbreak

UltraChat