Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020

TOP 文献データベース Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2602.08744

PDF

https://arxiv.org/pdf/2602.08744

文献情報

作者: Diego Ferreira Duarte,Andre Augusto Bortoli
公開日: 2026-2-9
所属機関: Instituto de Informatica – Universidade Federal do Rio Grande do Sul (UFRGS)
所属の国: Brazil
会議名

AIにより推定されたラベル

機械学習によるマルウェア分類不均衡データセットデータ前処理

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Malware, malicious software designed to damage computer systems and perpetrate scams, is proliferating at an alarming rate, with thousands of new threats emerging daily. Android devices, prevalent in smartphones, smartwatches, tablets, and IoTs, represent a vast attack surface, making malware detection crucial. Although advanced analysis techniques exist, Machine Learning (ML) emerges as a promising tool to automate and accelerate the discovery of these threats. This work tests ML algorithms in detecting malicious code from dynamic execution characteristics. For this purpose, the CICMalDroid2020 dataset, composed of dynamically obtained Android malware behavior samples, was used with the algorithms XGBoost, Naıve Bayes (NB), Support Vector Classifier (SVC), and Random Forest (RF). The study focused on empirically evaluating the impact of the SMOTE technique, used to mitigate class imbalance in the data, on the performance of these models. The results indicate that, in 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements, with an average loss of 6.14 percentage points. Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%. It is inferred that SMOTE, although widely used, did not prove beneficial for Android malware detection in the CICMalDroid2020 dataset, possibly due to the complexity and sparsity of dynamic characteristics or the nature of malicious relationships. This work highlights the robustness of tree-ensemble models, such as XGBoost, and suggests that algorithmic data balancing approaches may be more effective than generating synthetic instances in certain cybersecurity scenarios