Despite the outstanding performance of Large language Models (LLMs) in
diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial
prompts are crafted to bypass their security mechanisms and elicit unexpected
responses. Although jailbreak attacks are prevalent, the understanding of their
underlying mechanisms remains limited. Recent studies have explained typical
jailbreaking behavior (e.g., the degree to which the model refuses to respond)
of LLMs by analyzing representation shifts in their latent space caused by
jailbreak prompts or identifying key neurons that contribute to the success of
jailbreak attacks. However, these studies neither explore diverse jailbreak
patterns nor provide a fine-grained explanation from the failure of circuit to
the changes of representational, leaving significant gaps in uncovering the
jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation
framework that analyzes jailbreak mechanisms from both representation (which
reveals how jailbreaks alter the model's harmfulness perception) and circuit
perspectives~(which uncovers the causes of these deceptions by identifying key
circuits contributing to the vulnerability), tracking their evolution
throughout the entire response generation process. We then conduct an in-depth
evaluation of jailbreak behavior on five mainstream LLMs under seven jailbreak
strategies. Our evaluation reveals that jailbreak prompts amplify components
that reinforce affirmative responses while suppressing those that produce
refusal. This manipulation shifts model representations toward safe clusters to
deceive the LLM, leading it to provide detailed responses instead of refusals.
Notably, we find a strong and consistent correlation between representation
deception and activation shift of key circuits across diverse jailbreak methods
and multiple LLMs.