AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

TOP Literature Database AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2604.24118

PDF

https://arxiv.org/pdf/2604.24118

Paper Information

Author: Zonghao Ying,Haozheng Wang,Jiangfan Liu,Quanchen Zou,Aishan Liu,Jian Yang,Yaodong Yang,Xianglong Liu
Published: 4-27-2026
Affiliation: Beihang University
Country: 中国
Conference

Labels Estimated by AI

Data Protection Method Indirect Prompt Injection LLM Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade-off where rigorous protection leads to over-defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one-shot self-correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.

External Datasets

OpenPromptInjection

AgentDojo

References

Secure computer system: Unified exposition and multics interpretation

David E Bell, Leonard J La Padula

Published: 1976

Don’t you (forget nlp): Prompt injection with control characters in chatgpt

Mark Breitenbach, Adrian Wood, Win Suen, Po-Ning Tseng

Published: 2023

Struq: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner

Defending Against Prompt Injection With a Few DefensiveTokens

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner

Published: 7.11.2025

When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.

Defense Method Prompt leaking Indirect Prompt Injection

arxiv

Cited by 2

Annual ACM Conference on Computer and Communications Security (CCS)

SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

Published: 10.8.2024

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

Prompt Injection Defense Method LLM Security

Advances in Neural Information Processing Systems

Agentdojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, F. Tramèr

Published: 2024

Gemini 2.5 flash | generative ai on vertex ai

Conference on Applied Machine Learning in Information Security (CAMLIS)

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, Emre Kiciman

Published: 3.21.2024

Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

Indirect Prompt Injection Malicious Prompt Prompt Injection

arXiv

Gpt-4o system card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford

Published: 2024

arxiv

Cited by 1

Computing Research Repository (CoRR)

PromptLocate: Localizing Prompt Injection Attacks

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, Neil Gong

Published: 10.14.2025

Prompt injection attacks deceive a large language model into completing an attacker-specified task instead of its intended task by contaminating its input data with an injected prompt, which consists of injected instruction(s) and data. Localizing the injected prompt within contaminated data is crucial for post-attack forensic analysis and data recovery. Despite its growing importance, prompt injection localization remains largely unexplored. In this work, we bridge this gap by proposing PromptLocate, the first method for localizing injected prompts. PromptLocate comprises three steps: (1) splitting the contaminated data into semantically coherent segments, (2) identifying segments contaminated by injected instructions, and (3) pinpointing segments contaminated by injected data. We show PromptLocate accurately localizes injected prompts across eight existing and eight adaptive attacks.

Prompt validation Large Language Model evaluation metrics

Proceedings of the 2024 conference on empirical methods in natural language processing: industry track

Tptu v2: Boosting task planning and tool usage of large language model-based agents in real-world industry systems

Yilun Kong, Jingqing Ruan, Yihong Chen, Bin Zhang, Tianpeng Bao, Shi Shiwei, Xiaoru Hu, Hangyu Mao, Ziyue Li, Xingyu Zeng, 1 others

Published: 2024

28th USENIX Security Symposium (USENIX Security 19)

Protecting cloud virtual machines from hypervisor and host operating system exploits

Shih-Wei Li, John S Koh, Jason Nieh

Published: 2019

arXiv

Automatic and universal prompt injection attacks against large language models

X. Liu, et al.

Published: 2024

33rd USENIX Security Symposium (USENIX Security 24)

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

Published: 2024

2025 IEEE Symposium on Security and Privacy (SP)

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong

Published: 2025

The llama 3 herd of models

LLaMa-Team

Published: 2024

cellmate: Sandboxing browser ai agents

Luoxi Meng, Henry Feng, Ilia Shumailov, Earlence Fernandes

Published: 2025

Ignore previous prompt: Attack techniques for language models

F. Perez, I. Ribeiro

Published: 2022

Agentbay: A hybrid interaction sandbox for seamless human-ai intervention in agentic systems

Yun Piao, Hongbo Min, Hang Su, Leilei Zhang, Lei Wang, Yue Yin, Xiao Wu, Zhejing Xu, Liwei Qu, Hang Li, 1 others

Published: 2025

Communications of the ACM

Formal requirements for virtualizable third generation architectures

Gerald J Popek, Robert P Goldberg

Published: 1974

Fine-tuned deberta-v3-base for prompt injection detection

ProtectAI.com

Published: 2024

Proceedings of the IEEE

The protection of information in computer systems

Jerome H Saltzer, Michael D Schroeder

Published: 1975

Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security

Optimization-based prompt injection attack to llm-as-a-judge

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong

Published: 2024

Promptarmor: Simple yet effective prompt injection defenses

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi

Published: 2025

Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Bitvisor: a thin hypervisor for enforcing i/o device security

Takahiro Shinagawa, Hideki Eiraku, Kouichi Tanimoto, Kazumasa Omote, Shoichi Hasegawa, Takashi Horie, Manabu Hirano, Kenichi Kourai, Yoshihiro Oyama, Eiji Kawai, 1 others

Published: 2009

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models

GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, 1 others

Published: 2025

Proceedings of the 33rd ACM International Conference on Multimedia

Manipulating multimodal agents via cross-modal prompt injection

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Published: 2025

Delimiters won’t save you from prompt injection

Simon Willison

Published: 2023

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Llm agents making agent tools

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, Jakob Nikolas Kather

Published: 2025

arxiv

Cited by 2

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Published: 12.21.2023

The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

Indirect Prompt Injection Vulnerability Analysis Malicious Prompt

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, Deqing Yang

Published: 2025

arXiv

MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents

Computing Research Repository (CoRR)

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

Published: 7.28.2023

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

LLM Security Prompt Injection Inappropriate Content Generation