These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Large language model (LLM) agents have demonstrated remarkable capabilities
in software engineering and cybersecurity tasks, including code generation,
vulnerability discovery, and automated testing. One critical but underexplored
application is automated web vulnerability reproduction, which transforms
vulnerability reports into working exploits. Although recent advances suggest
promising potential, challenges remain in applying LLM agents to real-world web
vulnerability reproduction scenarios. In this paper, we present the first
comprehensive evaluation of state-of-the-art LLM agents for automated web
vulnerability reproduction. We systematically assess 20 agents from software
engineering, cybersecurity, and general domains across 16 dimensions, including
technical capabilities, environment adaptability, and user experience factors,
on 3 representative web vulnerabilities. Based on the results, we select three
top-performing agents (OpenHands, SWE-agent, and CAI) for in-depth evaluation
on our benchmark dataset of 80 real-world CVEs spanning 7 vulnerability types
and 6 web technologies. Our results reveal that while LLM agents achieve
reasonable success on simple library-based vulnerabilities, they consistently
fail on complex service-based vulnerabilities requiring multi-component
environments. Complex environment configurations and authentication barriers
create a gap where agents can execute exploit code but fail to trigger actual
vulnerabilities. We observe high sensitivity to input guidance, with
performance degrading by over 33% under incomplete authentication information.
Our findings highlight the significant gap between current LLM agent
capabilities and the demands of reliable automated vulnerability reproduction,
emphasizing the need for advances in environmental adaptation and autonomous
problem-solving capabilities.