The expansion of the open source community and the rise of large language
models have raised ethical and security concerns on the distribution of source
code, such as misconduct on copyrighted code, distributions without proper
licenses, or misuse of the code for malicious purposes. Hence it is important
to track the ownership of source code, in which watermarking is a major
technique. Yet, drastically different from natural languages, source code
watermarking requires far stricter and more complicated rules to ensure the
readability as well as the functionality of the source code. Hence we introduce
SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into
source code, without affecting the usage and semantics of the code. To this
end, SrcMarker performs transformations on an AST-based intermediate
representation that enables unified transformations across different
programming languages. The core of the system utilizes learning-based embedding
and extraction modules to select rule-based transformations for watermarking.
In addition, a novel feature-approximation technique is designed to tackle the
inherent non-differentiability of rule selection, thus seamlessly integrating
the rule-based transformations and learning-based networks into an
interconnected system to enable end-to-end training. Extensive experiments
demonstrate the superiority of SrcMarker over existing methods in various
watermarking requirements.
外部データセット
MBXP
CodeSearchNet
GitHub-C
GitHub-Java
参考文献
2021 IEEE Symposium on Security and Privacy (SP)
Adversarial watermarking transformer: Towards tracing text provenance with data hiding
Sahar Abdelnabi, Mario Fritz
Published: 2021
Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security
Large-scale and language-oblivious code authorship identification
Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, DaeHun Nyang
Published: 2018
27th USENIX Security Symposium (USENIX Security)
Turning your weakness into a strength: Watermarking deep neural networks by backdooring
Y. Adi, C. Baum, M. Cisse, B. Pinkas, J. Keshet
Published: 2018
The Eleventh International Conference on Learning Representations
Multi-lingual evaluation of code generation models
Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al.
Published: 2022
2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems
Function level control flow obfuscation for software security
Vivek Balachandran, Ng Wee Keong, Sabu Emmanuel
Published: 2014
Advances in neural information processing systems
Hiding images in plain sight: Deep steganography
Shumeet Baluja
Published: 2017
2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)
Learning-based recursive aggregation of abstract syntax trees for code clone detection
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, Rachel Greenstadt
Published: 2015
Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results
A theory of dual channel constraints
Casey Casalnuovo, Earl T Barr, Santanu Kumar Dash, Prem Devanbu, Emily Morgan
Published: 2020
Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Natgen: generative pre-training by “naturalizing” source code
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, Baishakhi Ray
Published: 2022
Computational linguistics
Practical linguistic steganography using contextual synonym substitution and a novel vertex coding method
Ching-Yun Chang, Stephen Clark
Published: 2014
Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017
Software watermarking for java program based on method name encoding
Jianping Chen, Kui Li, Wanzhi Wen, Weixu Chen, Chenxue Yan
Published: 2018
2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC)
Hidden path: dynamic software watermarking based on control flow obfuscation
Zhe Chen, Chunfu Jia, Donghui Xu
Published: 2017
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Learning phrase representations using rnn encoder–decoder for statistical machine translation