Abstract
Ethical hacking today relies on highly skilled practitioners executing
complex sequences of commands, which is inherently time-consuming, difficult to
scale, and prone to human error. To help mitigate these limitations, we
previously introduced 'PenTest++', an AI-augmented system combining automation
with generative AI supporting ethical hacking workflows. However, a key
limitation of PenTest++ was its lack of support for privilege escalation, a
crucial element of ethical hacking. In this paper we present 'PenTest2.0', a
substantial evolution of PenTest++ supporting automated privilege escalation
driven entirely by Large Language Model reasoning. It also incorporates several
significant enhancements: 'Retrieval-Augmented Generation', including both
one-line and offline modes; 'Chain-of-Thought' prompting for intermediate
reasoning; persistent 'PenTest Task Trees' to track goal progression across
turns; and the optional integration of human-authored hints. We describe how it
operates, present a proof-of-concept prototype, and discuss its benefits and
limitations. We also describe application of the system to a controlled Linux
target, showing it can carry out multi-turn, adaptive privilege escalation. We
explain the rationale behind its core design choices, and provide comprehensive
testing results and cost analysis. Our findings indicate that 'PenTest2.0'
represents a meaningful step toward practical, scalable, AI-automated
penetration testing, whilst highlighting the shortcomings of generative AI
systems, particularly their sensitivity to prompt structure, execution context,
and semantic drift, reinforcing the need for further research and refinement in
this emerging space.
Keywords: AI, Ethical Hacking, Privilege Escalation, GenAI, ChatGPT, LLM
(Large Language Model), HITL (Human-in-the-Loop)