VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Authors: Qilin Liao, Anamika Lochab, Ruqi Zhang
Published: 2025-10-20

Source: https://arxiv.org/abs/2510.17759

Labels Predicted by AI

Untargeted Toxicity Attack Model DoS Large Language Model

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75 best baseline on GPT-4o.