Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

TOP Literature Database Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2411.18688

PDF

https://arxiv.org/pdf/2411.18688

Paper Information

Author: Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Tianrui Guan,Mengdi Wang,Ahmad Beirami,Furong Huang,Alvaro Velasquez,Dinesh Manocha,Amrit Singh Bedi
Published: 11-28-2024
Updated: 3-21-2025
Affiliation: University of Maryland
Country: United States of America
Conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Labels Estimated by AI

Prompt Injection Safety Alignment Adversarial attack

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

External Datasets

JailbreakV-28K

MM-SafetyBench

FigStep

Visual Adversarial Attacks

MM-Vet