Jailbreaking Attack against Multimodal Large Language Model

TOP Literature Database Jailbreaking Attack against Multimodal Large Language Model

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2402.02309

PDF

https://arxiv.org/pdf/2402.02309

Paper Information

Author: Zhenxing Niu;Haodong Ren;Xinbo Gao;Gang Hua;Rong Jin
Published: 2-4-2024
Affiliation: Xidian University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Information Gathering Methods Malicious Content Generation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

External Datasets

AdvBench

AdvBench-M