Defending against Indirect Prompt Injection by Instruction Detection

TOP Literature Database Defending against Indirect Prompt Injection by Instruction Detection

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2505.06311

PDF

https://arxiv.org/pdf/2505.06311

Paper Information

Author: Tongyu Wen,Chenglong Wang,Xiyuan Yang,Haoyu Tang,Yueqi Xie,Lingjuan Lyu,Zhicheng Dou,Fangzhao Wu
Published: 5-8-2025
Updated: 9-17-2025
Affiliation: Renmin University of China
Country: China
Conference: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Labels Estimated by AI

Prompt validation Evaluation Method Watermarking Technology

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.

External Datasets

Wikipedia

News Articles

LaMini-instruction

BIPIA