Mobile applications (apps) often transmit sensitive data through network with
various intentions. Some transmissions are needed to fulfill the app's
functionalities. However, transmissions with malicious receivers may lead to
privacy leakage and tend to behave stealthily to evade detection. The problem
is twofold: how does one unveil sensitive transmissions in mobile apps, and
given a sensitive transmission, how does one determine if it is legitimate?
In this paper, we propose LeakSemantic, a framework that can automatically
locate abnormal sensitive network transmissions from mobile apps. LeakSemantic
consists of a hybrid program analysis component and a machine learning
component. Our program analysis component combines static analysis and dynamic
analysis to precisely identify sensitive transmissions. Compared to existing
taint analysis approaches, LeakSemantic achieves better accuracy with fewer
false positives and is able to collect runtime data such as network traffic for
each transmission. Based on features derived from the runtime data, machine
learning classifiers are built to further differentiate between the legal and
illegal disclosures. Experiments show that LeakSemantic achieves 91% accuracy
on 2279 sensitive connections from 1404 apps.