Android malware detection has been extensively studied using both traditional
machine learning (ML) and deep learning (DL) approaches. While many
state-of-the-art detection models, particularly those based on DL, claim
superior performance, they often rely on limited comparisons, lacking
comprehensive benchmarking against traditional ML models across diverse
datasets. This raises concerns about the robustness of DL-based approaches'
performance and the potential oversight of simpler, more efficient ML models.
In this paper, we conduct a systematic evaluation of Android malware detection
models across four datasets: three recently published, publicly available
datasets and a large-scale dataset we systematically collected. We implement a
range of traditional ML models, including Random Forests (RF) and CatBoost,
alongside advanced DL models such as Capsule Graph Neural Networks (CapsGNN),
BERT-based models, and ExcelFormer based models. Our results reveal that in
many cases simpler and more computationally efficient ML models achieve
comparable or even superior performance compared with DL models. These findings
highlight the need for rigorous benchmarking in Android malware detection
research. We encourage future studies to conduct more comprehensive
benchmarking comparisons between traditional and advanced models to ensure a
more accurate assessment of detection capabilities. To facilitate further
research, we provide access to our dataset, including app IDs, hash values, and
labels.