Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

TOP Literature Database Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2205.00665

PDF

https://arxiv.org/pdf/2205.00665

Paper Information

Author: Rui Shu;Tianpei Xia;Huy Tu;Laurie Williams;Tim Menzies
Published: 5-2-2022
Affiliation: North Carolina State University
Country: United States of America
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Hyperparameter Optimization Poisoning Model Performance Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.

External Datasets

Twitter spam dataset

malware URLs dataset

CIC-IDS-2017 dataset