These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models are fundamental actors in the modern IT landscape
dominated by AI solutions. However, security threats associated with them might
prevent their reliable adoption in critical application scenarios such as
government organizations and medical institutions. For this reason, commercial
LLMs typically undergo a sophisticated censoring mechanism to eliminate any
harmful output they could possibly produce. In response to this, LLM
Jailbreaking is a significant threat to such protections, and many previous
approaches have already demonstrated its effectiveness across diverse domains.
Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft
malicious input. To improve the comprehension of censoring mechanisms and
design a targeted jailbreak attack, we propose an Explainable-AI solution that
comparatively analyzes the behavior of censored and uncensored models to derive
unique exploitable alignment patterns. Then, we propose XBreaking, a novel
jailbreak attack that exploits these unique patterns to break the security
constraints of LLMs by targeted noise injection. Our thorough experimental
campaign returns important insights about the censoring mechanisms and
demonstrates the effectiveness and performance of our attack.