Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

TOP Literature Database Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2501.01042

PDF

https://arxiv.org/pdf/2501.01042

Paper Information

Author: Linhao Huang;Xue Jiang;Zhiqiang Wang;Wentao Mo;Xi Xiao;Bo Han;Yongjie Yin;Feng Zheng
Published: 1-2-2025
Updated: 1-10-2025
Affiliation: Tsinghua University
Country: China
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Attack Method Adversarial Example Attack Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.

External Datasets

MSVD-QA

MSRVTT-QA

ActivityNet-200