JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

TOP 文献データベース JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

ACM Multimedia

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2508.05087

PDF

https://arxiv.org/pdf/2508.05087

文献情報

作者: Renmiao Chen,Shiyao Cui,Xuancheng Huang,Chengwei Pan,Victor Shea-Jay Huang,QingLin Zhang,Xuan Ouyang,Zhexin Zhang,Hongning Wang,Minlie Huang
公開日: 2025-8-9
所属機関: CoAI, DCST, Tsinghua Univ.
所属の国: China
会議名: ACM Multimedia

AIにより推定されたラベル

攻撃戦略分析プロンプトインジェクション不適切コンテンツ生成

Abstract

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}