Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

TOP Literature Database Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2310.10844

PDF

https://arxiv.org/pdf/2310.10844

Paper Information

Author: Erfan Shayegani;Md Abdullah Al Mamun;Yu Fu;Pedram Zaree;Yue Dong;Nael Abu-Ghazaleh
Published: 10-17-2023
Affiliation: CSE Department, UC Riverside
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt Injection Adversarial Training Adversarial Example

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

External Datasets

Colossal Clean Crawled Corpus (C4)

Reddit corpus

The Pile

Multilingual text data used by BLOOM

Multilingual text data used by PaLM

Stack Exchange

GitHub

Codex

AlphaCode

Code Llama

StarCoder

Self-Instruct

Flan

InstructWild