Paper Information
- Author
- Feijiang Han,Jiaming Zhang,Chuyi Deng,Jianheng Tang,Yunhuai Liu
- Published
- 4-15-2025
- Updated
- 10-13-2025
- Affiliation
- University of Pennsylvania
- Country
- United States of America
- Conference
- Computing Research Repository (CoRR)
Abstract
WebShell attacks, where malicious scripts are injected into web servers, pose
a significant cybersecurity threat. Traditional ML and DL methods are often
hampered by challenges such as the need for extensive training data,
catastrophic forgetting, and poor generalization. Recently, Large Language
Models have emerged as powerful alternatives for code-related tasks, but their
potential in WebShell detection remains underexplored. In this paper, we make
two contributions: (1) a comprehensive evaluation of seven LLMs, including
GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional
sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and
(2) the Behavioral Function-Aware Detection (BFAD) framework, designed to
address the specific challenges of applying LLMs to this domain. Our framework
integrates three components: a Critical Function Filter that isolates malicious
PHP function calls, a Context-Aware Code Extraction strategy that captures the
most behaviorally indicative code segments, and Weighted Behavioral Function
Profiling that enhances in-context learning by prioritizing the most relevant
demonstrations based on discriminative function-level profiles. Our results
show that, stemming from their distinct analytical strategies, larger LLMs
achieve near-perfect precision but lower recall, while smaller models exhibit
the opposite trade-off. However, all baseline models lag behind previous SOTA
methods. With the application of BFAD, the performance of all LLMs improves
significantly, yielding an average F1 score increase of 13.82%. Notably, larger
models now outperform SOTA benchmarks, while smaller models such as
Qwen-2.5-Coder-3B achieve performance competitive with traditional methods.
This work is the first to explore the feasibility and limitations of LLMs for
WebShell detection and provides solutions to address the challenges in this
task.