Based on API call sequences, semantic-aware and machine learning (ML) based
malware classifiers can be built for malware detection or classification.
Previous works concentrate on crafting and extracting various features from
malware binaries, disassembled binaries or API calls via static or dynamic
analysis and resorting to ML to build classifiers. However, they tend to
involve too much feature engineering and fail to provide interpretability. We
solve these two problems with the recent advances in deep learning: 1)
RNN-based autoencoders (RNN-AEs) can automatically learn low-dimensional
representation of a malware from its raw API call sequence. 2) Multiple
decoders can be trained under different supervisions to give more information,
other than the class or family label of a malware. Inspired by the works of
document classification and automatic sentence summarization, each API call
sequence can be regarded as a sentence. In this paper, we make the first
attempt to build a multi-task malware learning model based on API call
sequences. The model consists of two decoders, one for malware classification
and one for $\emph{file access pattern}$ (FAP) generation given the API call
sequence of a malware. We base our model on the general seq2seq framework.
Experiments show that our model can give competitive classification results as
well as insightful FAP information.