These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Enterprise penetration-testing is often limited by high operational costs and
the scarcity of human expertise. This paper investigates the feasibility and
effectiveness of using Large Language Model (LLM)-driven autonomous systems to
address these challenges in real-world Active Directory (AD) enterprise
networks.
We introduce a novel prototype designed to employ LLMs to autonomously
perform Assumed Breach penetration-testing against enterprise networks. Our
system represents the first demonstration of a fully autonomous, LLM-driven
framework capable of compromising accounts within a real-life Microsoft Active
Directory testbed, GOAD.
We perform our empirical evaluation using five LLMs, comparing reasoning to
non-reasoning models as well as including open-weight models. Through
quantitative and qualitative analysis, incorporating insights from
cybersecurity experts, we demonstrate that autonomous LLMs can effectively
conduct Assumed Breach simulations. Key findings highlight their ability to
dynamically adapt attack strategies, perform inter-context attacks (e.g.,
web-app audits, social engineering, and unstructured data analysis for
credentials), and generate scenario-specific attack parameters like realistic
password candidates. The prototype exhibits robust self-correction mechanisms,
installing missing tools and rectifying invalid command generations.
We find that the associated costs are competitive with, and often
significantly lower than, those incurred by professional human pen-testers,
suggesting a path toward democratizing access to essential security testing for
organizations with budgetary constraints. However, our research also
illuminates existing limitations, including instances of LLM ``going down
rabbit holes'', challenges in comprehensive information transfer between
planning and execution modules, and critical safety concerns that necessitate
human oversight.