These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs), despite advanced general capabilities, still
suffer from numerous safety risks, especially jailbreak attacks that bypass
safety protocols. Understanding these vulnerabilities through black-box
jailbreak attacks, which better reflect real-world scenarios, offers critical
insights into model robustness. While existing methods have shown improvements
through various prompt engineering techniques, their success remains limited
against safety-aligned models, overlooking a more fundamental problem: the
effectiveness is inherently bounded by the predefined strategy spaces. However,
expanding this space presents significant challenges in both systematically
capturing essential attack patterns and efficiently navigating the increased
complexity. To better explore the potential of expanding the strategy space, we
address these challenges through a novel framework that decomposes jailbreak
strategies into essential components based on the Elaboration Likelihood Model
(ELM) theory and develops genetic-based optimization with intention evaluation
mechanisms. To be striking, our experiments reveal unprecedented jailbreak
capabilities by expanding the strategy space: we achieve over 90% success rate
on Claude-3.5 where prior methods completely fail, while demonstrating strong
cross-model transferability and surpassing specialized safeguard models in
evaluation accuracy. The code is open-sourced at:
https://github.com/Aries-iai/CL-GSO.
External Datasets
AdvBench Subset
AdvBench Original Set
Competition for LLM and Agent Safety (CLAS) 2024 Dataset