Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

TOP Literature Database Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2504.10112

PDF

https://arxiv.org/pdf/2504.10112

Paper Information

Author: Andreas Happe,Jürgen Cito
Published: 4-14-2025
Updated: 6-16-2025
Affiliation: TU Wien
Country: Austria
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Prompt validation Testbed Progress Tracking

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. Due to the opaque nature of LLMs, empirical methods are typically used to analyze their efficacy. The quality of this analysis is highly dependent on the chosen testbed, captured metrics and analysis methods employed. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 19 research papers detailing 18 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios.

External Datasets

NYU CTF Dataset

picoCTF

VulnHub

metasploitable

HackTheBox