These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Prompt injection attacks manipulate large language models (LLMs) by
misleading them to deviate from the original input instructions and execute
maliciously injected instructions, because of their instruction-following
capabilities and inability to distinguish between the original input
instructions and maliciously injected instructions. To defend against such
attacks, recent studies have developed various detection mechanisms. If we
restrict ourselves specifically to works which perform detection rather than
direct defense, most of them focus on direct prompt injection attacks, while
there are few works for the indirect scenario, where injected instructions are
indirectly from external tools, such as a search engine. Moreover, current
works mainly investigate injection detection methods and pay less attention to
the post-processing method that aims to mitigate the injection after detection.
In this paper, we investigate the feasibility of detecting and removing
indirect prompt injection attacks, and we construct a benchmark dataset for
evaluation. For detection, we assess the performance of existing LLMs and
open-source detection models, and we further train detection models using our
crafted training datasets. For removal, we evaluate two intuitive methods: (1)
the segmentation removal method, which segments the injected document and
removes parts containing injected instructions, and (2) the extraction removal
method, which trains an extraction model to identify and remove injected
instructions.