When large language model (LLM) systems interact with external data to
perform complex tasks, a new attack, namely prompt injection, becomes a
significant threat. By injecting instructions into the data accessed by the
system, the attacker is able to override the initial user task with an
arbitrary task directed by the attacker. To secure the system, test-time
defenses, e.g., defensive prompting, have been proposed for system developers
to attain security only when needed in a flexible manner. However, they are
much less effective than training-time defenses that change the model
parameters. Motivated by this, we propose DefensiveToken, a test-time defense
with prompt injection robustness comparable to training-time alternatives.
DefensiveTokens are newly inserted as special tokens, whose embeddings are
optimized for security. In security-sensitive cases, system developers can
append a few DefensiveTokens before the LLM input to achieve security with a
minimal utility drop. In scenarios where security is less of a concern,
developers can simply skip DefensiveTokens; the LLM system remains the same as
there is no defense, generating high-quality responses. Thus, DefensiveTokens,
if released alongside the model, allow a flexible switch between the
state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code
is available at https://github.com/Sizhe-Chen/DefensiveToken.