Defending large language models (LLMs) against jailbreak attacks is crucial
for ensuring their safe deployment. Existing defense strategies typically rely
on predefined static criteria to differentiate between harmful and benign
prompts. However, such rigid rules fail to accommodate the inherent complexity
and dynamic nature of real-world jailbreak attacks. In this paper, we focus on
the novel challenge of universal defense against diverse jailbreaks. We propose
a new concept ``mirror'', which is a dynamically generated prompt that reflects
the syntactic structure of the input while ensuring semantic safety. The
discrepancies between input prompts and their corresponding mirrors serve as
guiding principles for defense. A novel defense model, MirrorShield, is further
proposed to detect and calibrate risky inputs based on the crafted mirrors.
Evaluated on multiple benchmark datasets and compared against ten
state-of-the-art attack methods, MirrorShield demonstrates superior defense
performance and promising generalization capabilities.