Large Reasoning Models (LRMs) have significantly advanced beyond traditional
Large Language Models (LLMs) with their exceptional logical reasoning
capabilities, yet these improvements introduce heightened safety risks. When
subjected to jailbreak attacks, their ability to generate more targeted and
organized content can lead to greater harm. Although some studies claim that
reasoning enables safer LRMs against existing LLM attacks, they overlook the
inherent flaws within the reasoning process itself. To address this gap, we
propose the first jailbreak attack targeting LRMs, exploiting their unique
vulnerabilities stemming from the advanced reasoning capabilities.
Specifically, we introduce a Chaos Machine, a novel component to transform
attack prompts with diverse one-to-one mappings. The chaos mappings iteratively
generated by the machine are embedded into the reasoning chain, which
strengthens the variability and complexity and also promotes a more robust
attack. Based on this, we construct the Mousetrap framework, which makes
attacks projected into nonlinear-like low sample spaces with mismatched
generalization enhanced. Also, due to the more competing objectives, LRMs
gradually maintain the inertia of unpredictable iterative reasoning and fall
into our trap. Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet
and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic
dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench,
attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly
achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This
paper contains inappropriate, offensive and harmful content.