These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Multimodal large language models (MLLMs) have demonstrated significant
utility across diverse real-world applications. But MLLMs remain vulnerable to
jailbreaks, where adversarial inputs can collapse their safety constraints and
trigger unethical responses. In this work, we investigate jailbreaks in the
text-vision multimodal setting and pioneer the observation that visual
alignment imposes uneven safety constraints across modalities in MLLMs, thereby
giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a
black-box jailbreak method grounded in reinforcement learning. Initially, we
probe the model's attention dynamics and latent representation space, assessing
how visual inputs reshape cross-modal information flow and diminish the model's
ability to separate harmful from benign inputs, thereby exposing exploitable
vulnerabilities. On this basis, we systematize them into generalizable and
reusable operational rules that constitute a structured library of Atomic
Strategy Primitives, which translate harmful intents into jailbreak inputs
through step-wise transformations. Guided by the primitives, PolyJailbreak
employs a multi-agent optimization process that automatically adapts inputs
against the target models. We conduct comprehensive evaluations on a variety of
open-source and closed-source MLLMs, demonstrating that PolyJailbreak
outperforms state-of-the-art baselines.