These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
When training a machine learning model, there is likely to be a tradeoff
between accuracy and the diversity of the dataset. Previous research has shown
that if we train a model to detect one specific malware family, we generally
obtain stronger results as compared to a case where we train a single model on
multiple diverse families. However, during the detection phase, it would be
more efficient to have a single model that can reliably detect multiple
families, rather than having to score each sample against multiple models. In
this research, we conduct experiments based on byte $n$-gram features to
quantify the relationship between the generality of the training dataset and
the accuracy of the corresponding machine learning models, all within the
context of the malware detection problem. We find that neighborhood-based
algorithms generalize surprisingly well, far outperforming the other machine
learning techniques considered.