MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

TOP 文献データベース MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2503.12931

PDF

https://arxiv.org/pdf/2503.12931

文献情報

作者: Rui Pu,Chaozhuo Li,Rui Ha,Litian Zhang,Lirong Qiu,Xi Zhang
公開日: 2025-3-17
更新日: 2025-5-20
所属機関: Beijing University of Posts and Telecommunications
所属の国: China
会議名

AIにより推定されたラベル

攻撃手法プロンプトインジェクション大規模言語モデル

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of universal defense against diverse jailbreaks. We propose a new concept ``mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

外部データセット

Advbench

HEx-PHI

AlpacaEval

VicunaEval