Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

TOP Literature Database Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2401.11170

PDF

https://arxiv.org/pdf/2401.11170

Paper Information

Author: Kuofeng Gao,Yang Bai,Jindong Gu,Shu-Tao Xia,Philip Torr,Zhifeng Li,Wei Liu
Published: 1-20-2024
Updated: 3-23-2024
Affiliation: Tsinghua University
Country: China
Conference: International Conference on Learning Representations (ICLR)

Labels Estimated by AI

Model Evaluation Model DoS Resource Scarcity Issues

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.

External Datasets

MS-COCO

ImageNet

VQAv2

GQA