These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The rapid development of Large Language Models (LLMs) for code generation has
transformed software development by automating coding tasks with unprecedented
efficiency.
However, the training of these models on open-source code repositories (e.g.,
from GitHub) raises critical ethical and legal concerns, particularly regarding
data authorization and open-source license compliance. Developers are
increasingly questioning whether model trainers have obtained proper
authorization before using repositories for training, especially given the lack
of transparency in data collection.
To address these concerns, we propose a novel data marking framework RepoMark
to audit the data usage of code LLMs. Our method enables auditors to verify
whether their code has been used in training, while ensuring semantic
preservation, imperceptibility, and theoretical false detection rate (FDR)
guarantees. By generating multiple semantically equivalent code variants,
RepoMark introduces data marks into the code files, and during detection,
RepoMark leverages a novel ranking-based hypothesis test to detect model
behavior difference on trained data. Compared to prior data auditing
approaches, RepoMark significantly enhances data efficiency, allowing effective
auditing even when the user's repository possesses only a small number of code
files.
Experiments demonstrate that RepoMark achieves a detection success rate over
90\% on small code repositories under a strict FDR guarantee of 5\%. This
represents a significant advancement over existing data marking techniques, all
of which only achieve accuracy below 55\% under identical settings. This
further validates RepoMark as a robust, theoretically sound, and promising
solution for enhancing transparency in code LLM training, which can safeguard
the rights of code authors.