These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Binary analysis of software is a critical step in cyber forensics
applications such as program vulnerability assessment and malware detection.
This involves interpreting instructions executed by software and often
necessitates converting the software's binary file data to assembly language.
The conversion process requires information about the binary file's target
instruction set architecture (ISA). However, ISA information might not be
included in binary files due to compilation errors, partial downloads, or
adversarial corruption of file metadata. Machine learning (ML) is a promising
methodology that can be used to identify the target ISA using binary data in
the object code section of binary files. In this paper we propose a binary code
feature extraction model to improve the accuracy and scalability of ML-based
ISA identification methods. Our feature extraction model can be used in the
absence of domain knowledge about the ISAs. Specifically, we adapt models from
natural language processing (NLP) to i) identify successive byte patterns
commonly observed in binary codes, ii) estimate the significance of each byte
pattern to a binary file, and iii) estimate the relevance of each byte pattern
in distinguishing between ISAs. We introduce character-level features of
encoded binaries to identify fine-grained bit patterns inherent to each ISA. We
use a dataset with binaries from 12 different ISAs to evaluate our approach.
Empirical evaluations show that using our byte-level features in ML-based ISA
identification results in an 8% higher accuracy than the state-of-the-art
features based on byte-histograms and byte pattern signatures. We observe that
character-level features allow reducing the size of the feature set by up to
16x while maintaining accuracy above 97%.