These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models have shown impressive generative capabilities across
diverse tasks, but their safety remains a critical concern. Existing
post-training alignment methods, such as SFT and RLHF, reduce harmful outputs
yet leave LLMs vulnerable to jailbreak attacks, especially advanced
optimization-based ones. Recent system-2 approaches enhance safety by adding
inference-time reasoning, where models assess potential risks before producing
responses. However, we find these methods fail against powerful
out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial
Reasoning, which conceal malicious goals behind seemingly benign prompts. We
observe that all jailbreaks ultimately aim to embed a core malicious intent,
suggesting that extracting this intent is key to defense. To this end, we
propose ARMOR, which introduces a structured three-step reasoning pipeline: (1)
analyze jailbreak strategies from an external, updatable strategy library, (2)
extract the core intent, and (3) apply policy-based safety verification. We
further develop ARMOR-Think, which decouples safety reasoning from general
reasoning to improve both robustness and utility. Evaluations on advanced
optimization-based jailbreaks and safety benchmarks show that ARMOR achieves
state-of-the-art safety performance, with an average harmful rate of 0.002 and
an attack success rate of 0.06 against advanced optimization-based jailbreaks,
far below other reasoning-based models. Moreover, ARMOR demonstrates strong
generalization to unseen jailbreak strategies, reducing their success rate to
zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak
attacks, offering a practical path toward secure and reliable LLMs.