Metadata are associated to most of the information we produce in our daily
interactions and communication in the digital world. Yet, surprisingly,
metadata are often still catergorized as non-sensitive. Indeed, in the past,
researchers and practitioners have mainly focused on the problem of the
identification of a user from the content of a message.
In this paper, we use Twitter as a case study to quantify the uniqueness of
the association between metadata and user identity and to understand the
effectiveness of potential obfuscation strategies. More specifically, we
analyze atomic fields in the metadata and systematically combine them in an
effort to classify new tweets as belonging to an account using different
machine learning algorithms of increasing complexity. We demonstrate that
through the application of a supervised learning algorithm, we are able to
identify any user in a group of 10,000 with approximately 96.7% accuracy.
Moreover, if we broaden the scope of our search and consider the 10 most likely
candidates we increase the accuracy of the model to 99.22%. We also found that
data obfuscation is hard and ineffective for this type of data: even after
perturbing 60% of the training data, it is still possible to classify users
with an accuracy higher than 95%. These results have strong implications in
terms of the design of metadata obfuscation strategies, for example for data
set release, not only for Twitter, but, more generally, for most social media
platforms.