These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language model (LLM) agents are becoming increasingly skilled at
handling cybersecurity tasks autonomously. Thoroughly assessing their
cybersecurity capabilities is critical and urgent, given the high stakes in
this domain. However, existing benchmarks fall short, often failing to capture
real-world scenarios or being limited in scope. To address this gap, we
introduce CyberGym, a large-scale and high-quality cybersecurity evaluation
framework featuring 1,507 real-world vulnerabilities found and patched across
188 large software projects. While it includes tasks of various settings,
CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests
for vulnerability reproduction, based on text descriptions and corresponding
source repositories. Solving this task is particularly challenging, as it
requires comprehensive reasoning across entire codebases to locate relevant
code fragments and produce effective PoCs that accurately trigger the target
vulnerability starting from the program's entry point. Our evaluation across 4
state-of-the-art agent frameworks and 9 LLMs reveals that even the best
combination (OpenHands and Claude-3.7-Sonnet) achieves only a 11.9%
reproduction success rate, mainly on simpler cases. Beyond reproducing
historical vulnerabilities, we find that PoCs generated by LLM agents can
reveal new vulnerabilities, identifying 15 zero-days affecting the latest
versions of the software projects.