Network and system security are incredibly critical issues now. Due to the
rapid proliferation of malware, traditional analysis methods struggle with
enormous samples.
In this paper, we propose four easy-to-extract and small-scale features,
including sizes and permissions of Windows PE sections, content complexity, and
import libraries, to classify malware families, and use automatic machine
learning to search for the best model and hyper-parameters for each feature and
their combinations. Compared with detailed behavior-related features like API
sequences, proposed features provide macroscopic information about malware. The
analysis is based on static disassembly scripts and hexadecimal machine code.
Unlike dynamic behavior analysis, static analysis is resource-efficient and
offers complete code coverage, but is vulnerable to code obfuscation and
encryption.
The results demonstrate that features which work well in dynamic analysis are
not necessarily effective when applied to static analysis. For instance, API
4-grams only achieve 57.96% accuracy and involve a relatively high dimensional
feature set (5000 dimensions). In contrast, the novel proposed features
together with a classical machine learning algorithm (Random Forest) presents
very good accuracy at 99.40% and the feature vector is of much smaller
dimension (40 dimensions). We demonstrate the effectiveness of this approach
through integration in IDA Pro, which also facilitates the collection of new
training samples and subsequent model retraining.