These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Existing malicious code detection techniques demand the integration of
multiple tools to detect different malware patterns, often suffering from high
misclassification rates. Therefore, malicious code detection techniques could
be enhanced by adopting advanced, more automated approaches to achieve high
accuracy and a low misclassification rate. The goal of this study is to aid
security analysts in detecting malicious packages by empirically studying the
effectiveness of Large Language Models (LLMs) in detecting malicious code. We
present SocketAI, a malicious code review workflow to detect malicious code. To
evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of
5,115 npm packages, of which 2,180 packages have malicious code. We conducted a
baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL
static analysis tool, using 39 custom CodeQL rules developed in prior research
to detect malicious Javascript code. We also compare the effectiveness of
static analysis as a pre-screener with SocketAI workflow, measuring the number
of files that need to be analyzed. and the associated costs. Additionally, we
performed a qualitative study to understand the types of malicious activities
detected or missed by our workflow. Our baseline comparison demonstrates a 16%
and 9% improvement over static analysis in precision and F1 scores,
respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1
scores, while GPT-3 offers a more cost-effective balance at 91% precision and
94% F1 scores. Pre-screening files with a static analyzer reduces the number of
files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3
and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution
of arbitrary code, and suspicious domain categories as the top detected
malicious packages.