These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The increasing autonomy of Large Language Models (LLMs) necessitates a
rigorous evaluation of their potential to aid in cyber offense. Existing
benchmarks often lack real-world complexity and are thus unable to accurately
assess LLMs' cybersecurity capabilities. To address this gap, we introduce
PACEbench, a practical AI cyber-exploitation benchmark built on the principles
of realistic vulnerability difficulty, environmental complexity, and cyber
defenses. Specifically, PACEbench comprises four scenarios spanning single,
blended, chained, and defense vulnerability exploitations. To handle these
complex challenges, we propose PACEagent, a novel agent that emulates human
penetration testers by supporting multi-phase reconnaissance, analysis, and
exploitation. Extensive experiments with seven frontier LLMs demonstrate that
current models struggle with complex cyber scenarios, and none can bypass
defenses. These findings suggest that current models do not yet pose a
generalized cyber offense threat. Nonetheless, our work provides a robust
benchmark to guide the trustworthy development of future models.