CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

TOP Literature Database CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2503.17332

PDF

https://arxiv.org/pdf/2503.17332

Paper Information

Author: Yuxuan Zhu,Antony Kellermann,Dylan Bowman,Philip Li,Akul Gupta,Adarsh Danda,Richard Fang,Conner Jensen,Eric Ihli,Jason Benn,Jet Geronimo,Avi Dhir,Sudhit Rao,Kaicheng Yu,Twm Stone,Daniel Kang
Published: 3-22-2025
Updated: 6-24-2025
Affiliation: Siebel School of Computing and Data Science, University of Illinois, Urbana-Champaign
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Indirect Prompt Injection Vulnerability Prediction Cyber Threat

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.

External Datasets

CVE-Bench

National Vulnerability Database