A growing number of threats to Android phones creates challenges for malware
detection. Manually labeling the samples into benign or different malicious
families requires tremendous human efforts, while it is comparably easy and
cheap to obtain a large amount of unlabeled APKs from various sources.
Moreover, the fast-paced evolution of Android malware continuously generates
derivative malware families. These families often contain new signatures, which
can escape detection when using static analysis. These practical challenges can
also cause traditional supervised machine learning algorithms to degrade in
performance.
In this paper, we propose a framework that uses model-based semi-supervised
(MBSS) classification scheme on the dynamic Android API call logs. The
semi-supervised approach efficiently uses the labeled and unlabeled APKs to
estimate a finite mixture model of Gaussian distributions via conditional
expectation-maximization and efficiently detects malwares during out-of-sample
testing. We compare MBSS with the popular malware detection classifiers such as
support vector machine (SVM), $k$-nearest neighbor (kNN) and linear
discriminant analysis (LDA). Under the ideal classification setting, MBSS has
competitive performance with 98\% accuracy and very low false positive rate for
in-sample classification. For out-of-sample testing, the out-of-sample test
data exhibit similar behavior of retrieving phone information and sending to
the network, compared with in-sample training set. When this similarity is
strong, MBSS and SVM with linear kernel maintain 90\% detection rate while
$k$NN and LDA suffer great performance degradation. When this similarity is
slightly weaker, all classifiers degrade in performance, but MBSS still
performs significantly better than other classifiers.