ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

TOP 文献データベース ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2507.11500

PDF

https://arxiv.org/pdf/2507.11500

文献情報

作者: Zhengyue Zhao,Yingzi Ma,Somesh Jha,Marco Pavone,Patrick McDaniel,Chaowei Xiao
公開日: 2025-7-14
更新日: 2025-10-20
所属機関: University of Wisconsin-Madison
所属の国: United States of America
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

大規模言語モデル安全性分析評価基準

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.

外部データセット

Alert

BeaverTail-unsafe

WildJailbreak-vanilla

SaladBench-base

Alert-adversarial

JailbreakPair

WildJailbreak-adversarial

UltraSafety

SaladBench-attackEnhanced

BeaverTail-safe

WildJailbreak-benign