These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Advanced Persistent Threats (APTs) have caused significant losses across a
wide range of sectors, including the theft of sensitive data and harm to system
integrity. As attack techniques grow increasingly sophisticated and stealthy,
the arms race between cyber defenders and attackers continues to intensify. The
revolutionary impact of Large Language Models (LLMs) has opened up numerous
opportunities in various fields, including cybersecurity. An intriguing
question arises: can the extensive knowledge embedded in LLMs be harnessed for
provenance analysis and play a positive role in identifying previously unknown
malicious events? To seek a deeper understanding of this issue, we propose a
new strategy for taking advantage of LLMs in provenance-based threat detection.
In our design, the state-of-the-art LLM offers additional details in provenance
data interpretation, leveraging their knowledge of system calls, software
identity, and high-level understanding of application execution context. The
advanced contextualized embedding capability is further utilized to capture the
rich semantics of event descriptions. We comprehensively examine the quality of
the resulting embeddings, and it turns out that they offer promising avenues.
Subsequently, machine learning models built upon these embeddings demonstrated
outstanding performance on real-world data. In our evaluation, supervised
threat detection achieves a precision of 99.0%, and semi-supervised anomaly
detection attains a precision of 96.9%.