These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
By incorporating visual inputs, Multimodal Large Language Models (MLLMs)
extend LLMs to support visual reasoning. However, this integration also
introduces new vulnerabilities, making MLLMs susceptible to multimodal
jailbreak attacks and hindering their safe deployment.Existing defense methods,
including Image-to-Text Translation, Safe Prompting, and Multimodal Safety
Tuning, attempt to address this by aligning multimodal inputs with LLMs'
built-in safeguards.Yet, they fall short in uncovering root causes of
multimodal vulnerabilities, particularly how harmful multimodal tokens trigger
jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven
multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing
heavy training overhead.To bridge this gap, we present an comprehensive
analysis of where, how and which harmful multimodal tokens bypass safeguards in
MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers
are responsible for inducing unsafe behaviors, highlighting the potential of
precisely removing a small subset of harmful tokens, without requiring safety
tuning, can still effectively improve safety against jailbreaks. Motivated by
this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense
framework that selectively prunes harmful tokens at vulnerable layers while
restoring benign features at subsequent layers.Without incurring additional
computational overhead, SafePTR significantly enhances the safety of MLLMs
while preserving efficiency. Extensive evaluations across three MLLMs and five
benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating
jailbreak risks without compromising utility.