These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Background: Most of the existing machine learning models for security tasks,
such as spam detection, malware detection, or network intrusion detection, are
built on supervised machine learning algorithms. In such a paradigm, models
need a large amount of labeled data to learn the useful relationships between
selected features and the target class. However, such labeled data can be
scarce and expensive to acquire. Goal: To help security practitioners train
useful security classification models when few labeled training data and many
unlabeled training data are available. Method: We propose an adaptive framework
called Dapper, which optimizes 1) semi-supervised learning algorithms to assign
pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine
learning classifier (i.e., random forest). When the dataset class is highly
imbalanced, Dapper then adaptively integrates and optimizes a data oversampling
method called SMOTE. We use the novel Bayesian Optimization to search a large
hyperparameter space of these tuning targets. Result: We evaluate Dapper with
three security datasets, i.e., the Twitter spam dataset, the malware URLs
dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we
can use as low as 10% of original labeled data but achieve close or even better
classification performance than using 100% labeled data in a supervised way.
Conclusion: Based on those results, we would recommend using hyperparameter
optimization with semi-supervised learning when dealing with shortages of
labeled security data.