Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents | AIセキュリティポータル

EN

JA

EN

TOP 文献データベース Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

arxiv

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2606.13385

PDF

https://arxiv.org/pdf/2606.13385

文献情報

作者: Zihao Wang,Yiming Li,Yutong Wu,Zheyu Liu,Kangjie Chen,Fok Kar Wai,Pin-Yu Chen,Vrizlynn L. L. Thing,Bo Li,Dacheng Tao,Tianwei Zhang
公開日: 2026-6-11
所属機関: Nanyang Technological University, Singapore
所属の国: Singapore
会議名

AIにより推定されたラベル

自律エージェントセキュリティインダイレクトプロンプトインジェクションデータ駆動型脆弱性評価

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

参考文献

Advances in Neural Information Processing Systems

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan

Published: 2022

Advances in Neural Information Processing Systems

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, Yu Su

Published: 2023

Frontiers of Computer Science

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin

Published: 2024

ICML

Gpt-4v(ision) is a generalist web agent, if grounded

B. Zheng, B. Gou, J. Kil, H. Sun, Y. Su

Published: 2024

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried

Published: 2024

EMNLP

Webevolver: Enhancing web agent self-improvement with co-evolving world model

T. Fang, H. Zhang, Z. Zhang, K. Ma, W. Yu, H. Mi, D. Yu

Published: 2025

Ignore previous prompt: Attack techniques for language models

F. Perez, I. Ribeiro

Published: 2022

被引用数 27

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

Published: 2023.7.6

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

敵対的攻撃手法プロンプトインジェクションセキュリティ保証

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Webinject: Prompt injection attack to web agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong

Published: 2025

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Published: 2023.12.21

The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

インダイレクトプロンプトインジェクション脆弱性分析悪意のあるプロンプト

Agent skills enable a new class of realistic and trivially simple prompt injections

D. Schmotz, S. Abdelnabi, M. Andriushchenko

Published: 2025

Commercial llm agents are already vulnerable to simple yet dangerous attacks

A. Li, Y. Zhou, V. C. Raghuram, T. Goldstein, M. Goldblum

Published: 2025

The Twelfth International Conference on Learning Representations (ICLR)

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, Tatsunori Hashimoto

Published: 2024

Simple prompt injection attacks can leak personal data observed by llm agents during task execution

M. Alizadeh, Z. Samei, D. Stetsenko, F. Gilardi

Published: 2025

Toward a human-centered evaluation framework for trustworthy llm-powered gui agents

C. Chen, Z. Zhang, I. Khalilov, B. Guo, S. A. Gebreegziabher, Y. Ye, Z. Xiao, Y. Yao, T. Li, T. J.-J. Li

Published: 2025

International Conference on Learning Representations (ICLR)

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, Yongfeng Zhang

Published: 2024.10.4

Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.

プロンプトインジェクションバックドア攻撃

33rd USENIX Security Symposium (USENIX Security 24)

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

Published: 2024

St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

Published: 2024

Agentdyn: A dynamic open-ended benchmark for evaluating prompt injection attacks of real-world agent security system

Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao

Published: 2026

When ai meets the web: Prompt injection risks in third-party ai chatbot plugins

Y. Kaya, A. Landerer, S. Pletinckx, M. Zimmermann, C. Kruegel, G. Vigna

Published: 2025

Deepshop: A benchmark for deep research shopping agents

Y. Lyu, X. Zhang, L. Yan, M. de Rijke, Z. Ren, X. Chen

Published: 2025

AAAI

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

J. Wang, K. Xiao, Q. Sun, H. Zhao, T. Luo, J. D. Zhang, X. Zeng

Published: 2026