ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

TOP Literature Database ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2507.11500

PDF

https://arxiv.org/pdf/2507.11500

Paper Information

Author: Zhengyue Zhao,Yingzi Ma,Somesh Jha,Marco Pavone,Patrick McDaniel,Chaowei Xiao
Published: 7-14-2025
Updated: 10-20-2025
Affiliation: University of Wisconsin-Madison
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Large Language Model 安全性分析(Fail to translate) 評価基準(Fail to translate)

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze jailbreak strategies from an external, updatable strategy library, (2) extract the core intent, and (3) apply policy-based safety verification. We further develop ARMOR-Think, which decouples safety reasoning from general reasoning to improve both robustness and utility. Evaluations on advanced optimization-based jailbreaks and safety benchmarks show that ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks, far below other reasoning-based models. Moreover, ARMOR demonstrates strong generalization to unseen jailbreak strategies, reducing their success rate to zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak attacks, offering a practical path toward secure and reliable LLMs.

External Datasets

Alert

BeaverTail-unsafe

WildJailbreak-vanilla

SaladBench-base

Alert-adversarial

JailbreakPair

WildJailbreak-adversarial

UltraSafety

SaladBench-attackEnhanced

BeaverTail-safe

WildJailbreak-benign