These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The integration of Large Language Models (LLMs) with external sources is
becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a
prominent example. However, this integration introduces vulnerabilities of
Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in
external data can manipulate LLMs into executing unintended or harmful actions.
We recognize that IPI attacks fundamentally rely on the presence of
instructions embedded within external content, which can alter the behavioral
states of LLMs. Can the effective detection of such state changes help us
defend against IPI attacks? In this paper, we propose InstructDetector, a novel
detection-based approach that leverages the behavioral states of LLMs to
identify potential IPI attacks. Specifically, we demonstrate the hidden states
and gradients from intermediate layers provide highly discriminative features
for instruction detection. By effectively combining these features,
InstructDetector achieves a detection accuracy of 99.60% in the in-domain
setting and 96.90% in the out-of-domain setting, and reduces the attack success
rate to just 0.03% on the BIPIA benchmark. The code is publicly available at
https://github.com/MYVAE/Instruction-detection.