These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language model (LLM) agents are increasingly capable of autonomously
conducting cyberattacks, posing significant threats to existing applications.
This growing risk highlights the urgent need for a real-world benchmark to
evaluate the ability of LLM agents to exploit web application vulnerabilities.
However, existing benchmarks fall short as they are limited to abstracted
Capture the Flag competitions or lack comprehensive coverage. Building a
benchmark for real-world vulnerabilities involves both specialized expertise to
reproduce exploits and a systematic approach to evaluating unpredictable
threats. To address this challenge, we introduce CVE-Bench, a real-world
cybersecurity benchmark based on critical-severity Common Vulnerabilities and
Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents
to exploit vulnerable web applications in scenarios that mimic real-world
conditions, while also providing effective evaluation of their exploits. Our
evaluation shows that the state-of-the-art agent framework can resolve up to
13% of vulnerabilities.