XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

TOP Literature Database XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2504.21700

PDF

https://arxiv.org/pdf/2504.21700

Paper Information

Author: Marco Arazzi,Vignesh Kumar Kembu,Antonino Nocera,Vinod P
Published: 4-30-2025
Affiliation: Department of Electrical, Computer and Biomedical Engineering, University of Pavia
Country: Italy
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Explanation Method Disabling Safety Mechanisms of LLM

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.

External Datasets

JBB-Behaviors