These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The expansion of the open source community and the rise of large language
models have raised ethical and security concerns on the distribution of source
code, such as misconduct on copyrighted code, distributions without proper
licenses, or misuse of the code for malicious purposes. Hence it is important
to track the ownership of source code, in which watermarking is a major
technique. Yet, drastically different from natural languages, source code
watermarking requires far stricter and more complicated rules to ensure the
readability as well as the functionality of the source code. Hence we introduce
SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into
source code, without affecting the usage and semantics of the code. To this
end, SrcMarker performs transformations on an AST-based intermediate
representation that enables unified transformations across different
programming languages. The core of the system utilizes learning-based embedding
and extraction modules to select rule-based transformations for watermarking.
In addition, a novel feature-approximation technique is designed to tackle the
inherent non-differentiability of rule selection, thus seamlessly integrating
the rule-based transformations and learning-based networks into an
interconnected system to enable end-to-end training. Extensive experiments
demonstrate the superiority of SrcMarker over existing methods in various
watermarking requirements.