These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large Language Models (LLMs) are swiftly advancing in architecture and
capability, and as they integrate more deeply into complex systems, the urgency
to scrutinize their security properties grows. This paper surveys research in
the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield
of trustworthy ML, combining the perspectives of Natural Language Processing
and Security. Prior work has shown that even safety-aligned LLMs (via
instruction tuning and reinforcement learning through human feedback) can be
susceptible to adversarial attacks, which exploit weaknesses and mislead AI
systems, as evidenced by the prevalence of `jailbreak' attacks on models like
ChatGPT and Bard. In this survey, we first provide an overview of large
language models, describe their safety alignment, and categorize existing
research based on various learning structures: textual-only attacks,
multi-modal attacks, and additional attack methods specifically targeting
complex systems, such as federated learning or multi-agent systems. We also
offer comprehensive remarks on works that focus on the fundamental sources of
vulnerabilities and potential defenses. To make this field more accessible to
newcomers, we present a systematic review of existing works, a structured
typology of adversarial attack concepts, and additional resources, including
slides for presentations on related topics at the 62nd Annual Meeting of the
Association for Computational Linguistics (ACL'24).