These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The source code of a program not only defines its semantics but also contains
subtle clues that can identify its author. Several studies have shown that
these clues can be automatically extracted using machine learning and allow for
determining a program's author among hundreds of programmers. This attribution
poses a significant threat to developers of anti-censorship and
privacy-enhancing technologies, as they become identifiable and may be
prosecuted. An ideal protection from this threat would be the anonymization of
source code. However, neither theoretical nor practical principles of such an
anonymization have been explored so far.
In this paper, we tackle this problem and develop a framework for reasoning
about code anonymization. We prove that the task of generating a $k$-anonymous
program -- a program that cannot be attributed to one of $k$ authors -- is not
computable in the general case. As a remedy, we introduce a relaxed concept
called $k$-uncertainty, which enables us to measure the protection of
developers. Based on this concept, we empirically study candidate techniques
for anonymization, such as code normalization, coding style imitation, and code
obfuscation. We find that none of the techniques provides sufficient protection
when the attacker is aware of the anonymization. While we observe a notable
reduction in attribution performance on real-world code, a reliable protection
is not achieved for all developers. We conclude that code anonymization is a
hard problem that requires further attention from the research community.