A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos

TOP 文献データベース A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2502.15806

PDF

https://arxiv.org/pdf/2502.15806

文献情報

作者: Yang Yao,Xuan Tong,Ruofan Wang,Yixu Wang,Lujundong Li,Liang Liu,Yan Teng,Yingchun Wang
公開日: 2025-2-19
更新日: 2025-6-3
所属機関: Shanghai Artificial Intelligence Laboratory
所属の国: China
会議名: Annual Meeting of the Association for Computational Linguistics (ACL)

AIにより推定されたラベル

大規模言語モデル LLMの安全機構の解除倫理的考慮

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack. Based on this, we construct the Mousetrap framework, which makes attacks projected into nonlinear-like low sample spaces with mismatched generalization enhanced. Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap. Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This paper contains inappropriate, offensive and harmful content.

外部データセット

Trotter

TrotterStr

TrotterAdv

TrotterUltimate

JailbreakBench

MaliciousInstruct

JailBenchSeed

StrongREJECT

HarmBench

FigStep

AdvBench

HADES

RedTeam-2K

MM-SafetyBench