Probabilistic Modeling of Jailbreak on Multimodal LLMs: From Quantification to Application

TOP 文献データベース Probabilistic Modeling of Jailbreak on Multimodal LLMs: From Quantification to Application

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2503.06989

PDF

https://arxiv.org/pdf/2503.06989

文献情報

作者: Wenzhuo Xu,Zhipeng Wei,Xiongtao Sun,Zonghao Ying,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang,Quanchen Zou
公開日: 2025-8-2
所属機関: AI Security Lab
所属の国: China
会議名

AIにより推定されたラベル

プロンプトインジェクション透かし技術の堅牢性大規模言語モデル

Abstract

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal content. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on input image to maximize jailbreak probability, and further enhance it as Multimodal JPA (MJPA) by including monotonic text rephrasing. To counteract attacks, we also propose Jailbreak-Probability-based Finetuning (JPF), which minimizes jailbreak probability through MLLM parameter updates. Extensive experiments show that (1) (M)JPA yields significant improvements when attacking a wide range of models under both white and black box settings. (2) JPF vastly reduces jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.