Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

TOP Literature Database Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2509.23571

PDF

https://arxiv.org/pdf/2509.23571

Paper Information

Author: Yuqiao Meng,Luoxi Tang,Feiyang Yu,Xi Li,Guanhua Yan,Ping Yang,Zhaohan Xi
Published: 9-28-2025
Updated: 10-2-2025
Affiliation: Binghamton University
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Security Strategy Generation RAG Efficient Resolution of Learning Tasks

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

External Datasets

MITRE CVE database

NVD (National Vulnerability Database)

Exploit-DB

D3FEND

Oracle Security Alerts

Red Hat Bugzilla

RHSA (Red Hat Security Advisories)

IBM X-Force Exchange

CISE (Cybersecurity Information Sharing Environment)

VulDB (Vulnerability Database)

Apache Security Advisories

Mandiant Threat Intelligence Reports

Recorded Future Threat Intelligence Reports

Unit 42 Threat Research

Microsoft Security Update Guide

CVSS (Common Vulnerability Scoring System)

EPSS (Exploit Prediction Scoring System)

MISP (Malware Information Sharing Platform)

VirusTotal

AlienVault Open Threat Exchange (OTX)