These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Malware developers use combinations of techniques such as compression,
encryption, and obfuscation to bypass anti-virus software. Malware with
anti-analysis technologies can bypass AI-based anti-virus software and malware
analysis tools. Therefore, classifying pack files is one of the big challenges.
Problems arise if the malware classifiers learn packers' features, not those of
malware. Training the models with unintended erroneous data turn into poisoning
attacks, adversarial attacks, and evasion attacks. Therefore, researchers
should consider packing to build appropriate malware classifier models. In this
paper, we propose a multi-step framework for classifying and identifying packed
samples which consists of pseudo-optimal feature selection, machine
learning-based classifiers, and packer identification steps. In the first step,
we use the CART algorithm and the permutation importance to preselect important
20 features. In the second step, each model learns 20 preselected features for
classifying the packed files with the highest performance. As a result, the
XGBoost, which learned the features preselected by XGBoost with the permutation
importance, showed the highest performance of any other experiment scenarios
with an accuracy of 99.67%, an F1-Score of 99.46%, and an area under the curve
(AUC) of 99.98%. In the third step, we propose a new approach that can identify
packers only for samples classified as Well-Known Packed.