Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Authors: Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov | Published: 2023-07-19 | Updated: 2023-10-03

2023.07.192025.05.28

Authors: Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov
Published: 2023-07-19 | Updated: 2023-10-03

Source: https://arxiv.org/abs/2307.10490

PDF: https://arxiv.org/pdf/2307.10490

Labels Predicted by AI

Malicious Prompt Indirect Prompt Injection Adversarial Example

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker’s instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.