Wei Wang,Guozhu Meng,Haoyu Wang,Kai Chen,Weimin Ge,Xiaohong Li
公開日
2020-9-1
所属機関
Tianjin Key Laboratory of Advanced Networking (TANK), School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University
Authorship identification is the process of identifying and classifying
authors through given codes. Authorship identification can be used in a wide
range of software domains, e.g., code authorship disputes, plagiarism
detection, exposure of attackers' identity. Besides the inherent challenges
from legacy software development, framework programming and crowdsourcing mode
in Android raise the difficulties of authorship identification significantly.
More specifically, widespread third party libraries and inherited components
(e.g., classes, methods, and variables) dilute the primary code within the
entire Android app and blur the boundaries of code written by different
authors. However, prior research has not well addressed these challenges.
To this end, we design a two-phased approach to attribute the primary code of
an Android app to the specific developer. In the first phase, we put forward
three types of strategies to identify the relationships between Java packages
in an app, which consist of context, semantic and structural relationships. A
package aggregation algorithm is developed to cluster all packages that are of
high probability written by the same authors. In the second phase, we develop
three types of features to capture authors' coding habits and code stylometry.
Based on that, we generate fingerprints for an author from its developed
Android apps and employ several machine learning algorithms for authorship
classification. We evaluate our approach in three datasets that contain 15,666
apps from 257 distinct developers and achieve a 92.5% accuracy rate on average.
Additionally, we test it on 2,900 obfuscated apps and our approach can classify
apps with an accuracy rate of 80.4%.