Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

TOP Literature Database Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.21277

PDF

https://arxiv.org/pdf/2505.21277

Paper Information

Author: Yao Huang,Yitong Sun,Shouwei Ruan,Yichi Zhang,Yinpeng Dong,Xingxing Wei
Published: 5-27-2025
Updated: 5-28-2025
Affiliation: Institute of Artificial Intelligence, Beihang University
Country: China
Conference: Annual Meeting of the Association for Computational Linguistics (ACL)

Labels Estimated by AI

Disabling Safety Mechanisms of LLM Prompt Injection Attack Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.

External Datasets

AdvBench Subset

AdvBench Original Set

Competition for LLM and Agent Safety (CLAS) 2024 Dataset