Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

Authors: Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies | Published: 2022-05-02

2022.05.022025.05.28

Authors: Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies
Published: 2022-05-02

Source: https://arxiv.org/abs/2205.00665

PDF: https://arxiv.org/pdf/2205.00665

Labels Predicted by AI

Hyperparameter Optimization Poisoning Model Performance Evaluation

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10 classification performance than using 100 Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.