Defending Against Prompt Injection With a Few DefensiveTokens

TOP 文献データベース Defending Against Prompt Injection With a Few DefensiveTokens

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2507.07974

PDF

https://arxiv.org/pdf/2507.07974

文献情報

作者: Sizhe Chen,Yizhu Wang,Nicholas Carlini,Chawin Sitawarin,David Wagner
公開日: 2025-7-11
更新日: 2025-8-26
所属機関: UC Berkeley
所属の国: United States of America
会議名: AISec@CCS

AIにより推定されたラベル

防御手法プロンプトリーキングインダイレクトプロンプトインジェクション

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.

外部データセット

Cleaned Alpaca instruction tuning dataset

AlpacaFarm

SEP dataset