Recently, Multimodal Large Language Models (MLLMs) have demonstrated their
superior ability in understanding multimodal content. However, they remain
vulnerable to jailbreak attacks, which exploit weaknesses in their safety
alignment to generate harmful responses. Previous studies categorize jailbreaks
as successful or failed based on whether responses contain malicious content.
However, given the stochastic nature of MLLM responses, this binary
classification of an input's ability to jailbreak MLLMs is inappropriate.
Derived from this viewpoint, we introduce jailbreak probability to quantify the
jailbreak potential of an input, which represents the likelihood that MLLMs
generated a malicious response when prompted with this input. We approximate
this probability through multiple queries to MLLMs. After modeling the
relationship between input hidden states and their corresponding jailbreak
probability using Jailbreak Probability Prediction Network (JPPN), we use
continuous jailbreak probability for optimization. Specifically, we propose
Jailbreak-Probability-based Attack (JPA) that optimizes adversarial
perturbations on input image to maximize jailbreak probability, and further
enhance it as Multimodal JPA (MJPA) by including monotonic text rephrasing. To
counteract attacks, we also propose Jailbreak-Probability-based Finetuning
(JPF), which minimizes jailbreak probability through MLLM parameter updates.
Extensive experiments show that (1) (M)JPA yields significant improvements when
attacking a wide range of models under both white and black box settings. (2)
JPF vastly reduces jailbreaks by at most over 60\%. Both of the above results
demonstrate the significance of introducing jailbreak probability to make
nuanced distinctions among input jailbreak abilities.