State machines are essential for enhancing protocol analysis to identify
vulnerabilities. However, inferring state machines from network protocol
implementations is challenging due to complex code syntax and semantics.
Traditional dynamic analysis methods often miss critical state transitions due
to limited coverage, while static analysis faces path explosion issues. To
overcome these challenges, we introduce a novel state machine inference
approach utilizing Large Language Models (LLMs), named ProtocolGPT. This method
employs retrieval augmented generation technology to enhance a pre-trained
model with specific knowledge from protocol implementations. Through effective
prompt engineering, we accurately identify and infer state machines. To the
best of our knowledge, our approach represents the first state machine
inference that leverages the source code of protocol implementations. Our
evaluation of six protocol implementations shows that our method achieves a
precision of over 90%, outperforming the baselines by more than 30%.
Furthermore, integrating our approach with protocol fuzzing improves coverage
by more than 20% and uncovers two 0-day vulnerabilities compared to baseline
methods.