These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
In deployment and application, large language models (LLMs) typically undergo
safety alignment to prevent illegal and unethical outputs. However, the
continuous advancement of jailbreak attack techniques, designed to bypass
safety mechanisms with adversarial prompts, has placed increasing pressure on
the security defenses of LLMs. Strengthening resistance to jailbreak attacks
requires an in-depth understanding of the security mechanisms and
vulnerabilities of LLMs. However, the vast number of parameters and complex
structure of LLMs make analyzing security weaknesses from an internal
perspective a challenging task. This paper presents NeuroBreak, a top-down
jailbreak analysis system designed to analyze neuron-level safety mechanisms
and mitigate vulnerabilities. We carefully design system requirements through
collaboration with three experts in the field of AI security. The system
provides a comprehensive analysis of various jailbreak attack methods. By
incorporating layer-wise representation probing analysis, NeuroBreak offers a
novel perspective on the model's decision-making process throughout its
generation steps. Furthermore, the system supports the analysis of critical
neurons from both semantic and functional perspectives, facilitating a deeper
exploration of security mechanisms. We conduct quantitative evaluations and
case studies to verify the effectiveness of our system, offering mechanistic
insights for developing next-generation defense strategies against evolving
jailbreak attacks.