Large Language Models have shown impressive generative capabilities across
diverse tasks, but their safety remains a critical concern. Existing
post-training alignment methods, such as SFT and RLHF, reduce harmful outputs
yet leave LLMs vulnerable to jailbreak attacks, especially advanced
optimization-based ones. Recent system-2 approaches enhance safety by adding
inference-time reasoning, where models assess potential risks before producing
responses. However, we find these methods fail against powerful
out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial
Reasoning, which conceal malicious goals behind seemingly benign prompts. We
observe that all jailbreaks ultimately aim to embed a core malicious intent,
suggesting that extracting this intent is key to defense. To this end, we
propose ARMOR, which introduces a structured three-step reasoning pipeline: (1)
analyze jailbreak strategies from an external, updatable strategy library, (2)
extract the core intent, and (3) apply policy-based safety verification. We
further develop ARMOR-Think, which decouples safety reasoning from general
reasoning to improve both robustness and utility. Evaluations on advanced
optimization-based jailbreaks and safety benchmarks show that ARMOR achieves
state-of-the-art safety performance, with an average harmful rate of 0.002 and
an attack success rate of 0.06 against advanced optimization-based jailbreaks,
far below other reasoning-based models. Moreover, ARMOR demonstrates strong
generalization to unseen jailbreak strategies, reducing their success rate to
zero. These highlight ARMOR's effectiveness in defending against OOD jailbreak
attacks, offering a practical path toward secure and reliable LLMs.