Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

TOP Literature Database Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2401.00676

PDF

https://arxiv.org/pdf/2401.00676

Paper Information

Author: Haodong Li;Gelei Deng;Yi Liu;Kailong Wang;Yuekang Li;Tianwei Zhang;Yang Liu;Guoai Xu;Guosheng Xu;Haoyu Wang
Published: 1-1-2024
Affiliation: Beijing University of Posts and Telecommunications
Country: China
Conference

Labels Estimated by AI

Prompt Injection LLM Performance Evaluation Dataset Generation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.

External Datasets

WebText

Baseline Dataset

Unlearned Dataset I

Target Dataset

Unlearned Dataset II