AIセキュリティポータル K Program
Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020
Share
Abstract
Malware, malicious software designed to damage computer systems and perpetrate scams, is proliferating at an alarming rate, with thousands of new threats emerging daily. Android devices, prevalent in smartphones, smartwatches, tablets, and IoTs, represent a vast attack surface, making malware detection crucial. Although advanced analysis techniques exist, Machine Learning (ML) emerges as a promising tool to automate and accelerate the discovery of these threats. This work tests ML algorithms in detecting malicious code from dynamic execution characteristics. For this purpose, the CICMalDroid2020 dataset, composed of dynamically obtained Android malware behavior samples, was used with the algorithms XGBoost, Naıve Bayes (NB), Support Vector Classifier (SVC), and Random Forest (RF). The study focused on empirically evaluating the impact of the SMOTE technique, used to mitigate class imbalance in the data, on the performance of these models. The results indicate that, in 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements, with an average loss of 6.14 percentage points. Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%. It is inferred that SMOTE, although widely used, did not prove beneficial for Android malware detection in the CICMalDroid2020 dataset, possibly due to the complexity and sparsity of dynamic characteristics or the nature of malicious relationships. This work highlights the robustness of tree-ensemble models, such as XGBoost, and suggests that algorithmic data balancing approaches may be more effective than generating synthetic instances in certain cybersecurity scenarios
Share