These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
With the emergence of strong vision language capabilities, multimodal large
language models (MLLMs) have demonstrated tremendous potential for real-world
applications. However, the security vulnerabilities exhibited by the visual
modality pose significant challenges to deploying such models in open-world
environments. Recent studies have successfully induced harmful responses from
target MLLMs by encoding harmful textual semantics directly into visual inputs.
However, in these approaches, the visual modality primarily serves as a trigger
for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding
in realistic scenarios. In this work, we define a novel setting: vision-centric
jailbreak, where visual information serves as a necessary component in
constructing a complete and realistic jailbreak context. Building on this
setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates
contextual dialogue using four distinct vision-focused strategies, dynamically
generating auxiliary images when necessary to construct a vision-centric
jailbreak scenario. To maximize attack effectiveness, it incorporates automatic
toxicity obfuscation and semantic refinement to produce a final attack prompt
that reliably triggers harmful responses from the target black-box MLLMs.
Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success
Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming
the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%.
Code: https://github.com/Dtc7w3PQ/Visco-Attack.