Numerous tools have been developed to aggressively block the execution of
popular JavaScript programs (JS) in Web browsers. Such blocking also affects
functionality of webpages and impairs user experience. As a consequence, many
privacy preserving tools (PP-Tools) that have been developed to limit online
tracking, often executed via JS, may suffer from poor performance and limited
uptake. A mechanism that can isolate JS necessary for proper functioning of the
website from tracking JS would thus be useful. Through the use of a manually
labelled dataset composed of 2,612 JS, we show how current PP-Tools are
ineffective in finding the right balance between blocking tracking JS and
allowing functional JS. To the best of our knowledge, this is the first study
to assess the performance of current web PP-Tools.
To improve this balance, we examine the two classes of JS and hypothesize
that tracking JS share structural similarities that can be used to
differentiate them from functional JS. The rationale of our approach is that
web developers often borrow and customize existing pieces of code in order to
embed tracking (resp. functional) JS into their webpages. We then propose
one-class machine learning classifiers using syntactic and semantic features
extracted from JS. When trained only on samples of tracking JS, our classifiers
achieve an accuracy of 99%, where the best of the PP-Tools achieved an accuracy
of 78%.
We further test our classifiers and several popular PP-Tools on a corpus of
4K websites with 135K JS. The output of our best classifier on this data is
between 20 to 64% different from the PP-Tools. We manually analyse a sample of
the JS for which our classifier is in disagreement with all other PP-Tools, and
show that our approach is not only able to enhance user web experience by
correctly classifying more functional JS, but also discovers previously unknown
tracking services.