These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language models (LLMs) are becoming increasingly prevalent in modern
software systems, interfacing between the user and the Internet to assist with
tasks that require advanced language understanding. To accomplish these tasks,
the LLM often uses external data sources such as user documents, web retrieval,
results from API calls, etc. This opens up new avenues for attackers to
manipulate the LLM via prompt injection. Adversarial prompts can be injected
into external data sources to override the system's intended instruction and
instead execute a malicious instruction. To mitigate this vulnerability, we
propose a new defense called SecAlign based on the technique of preference
optimization. Our defense first constructs a preference dataset with
prompt-injected inputs, secure outputs (ones that respond to the legitimate
instruction), and insecure outputs (ones that respond to the injection). We
then perform preference optimization on this dataset to teach the LLM to prefer
the secure output over the insecure one. This provides the first known method
that reduces the success rates of various prompt injections to <10%, even
against attacks much more sophisticated than ones seen during training. This
indicates our defense generalizes well against unknown and yet-to-come attacks.
Also, SecAlign models are still practical with similar utility to the one
before defensive training in our evaluations. Our code is at
https://github.com/facebookresearch/SecAlign