Abstract
Because "out-of-the-box" large language models are capable of generating a
great deal of objectionable content, recent work has focused on aligning these
models in an attempt to prevent undesirable generation. While there has been
some success at circumventing these measures -- so-called "jailbreaks" against
LLMs -- these attacks have required significant human ingenuity and are brittle
in practice. In this paper, we propose a simple and effective attack method
that causes aligned language models to generate objectionable behaviors.
Specifically, our approach finds a suffix that, when attached to a wide range
of queries for an LLM to produce objectionable content, aims to maximize the
probability that the model produces an affirmative response (rather than
refusing to answer). However, instead of relying on manual engineering, our
approach automatically produces these adversarial suffixes by a combination of
greedy and gradient-based search techniques, and also improves over past
automatic prompt generation methods.
Surprisingly, we find that the adversarial prompts generated by our approach
are quite transferable, including to black-box, publicly released LLMs.
Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
queries asking for many different types of objectionable content), as well as
multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
attack suffix is able to induce objectionable content in the public interfaces
to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
Pythia, Falcon, and others. In total, this work significantly advances the
state-of-the-art in adversarial attacks against aligned language models,
raising important questions about how such systems can be prevented from
producing objectionable information. Code is available at
github.com/llm-attacks/llm-attacks.