文献情報
- 作者
- Shuchao Pang,Zhigang Lu,Haichen Wang,Peng Fu,Yongbin Zhou,Minhui Xue
- 公開日
- 2025-9-20
- 所属機関
- Nanjing University of Science and Technology
- 所属の国
- China
- 会議名
- International Symposium on Recent Advances in Intrusion Detection (RAID)
Abstract
Differential privacy (DP) is the de facto privacy standard against privacy
leakage attacks, including many recently discovered ones against large language
models (LLMs). However, we discovered that LLMs could reconstruct the
altered/removed privacy from given DP-sanitized prompts. We propose two attacks
(black-box and white-box) based on the accessibility to LLMs and show that LLMs
could connect the pair of DP-sanitized text and the corresponding private
training data of LLMs by giving sample text pairs as instructions (in the
black-box attacks) or fine-tuning data (in the white-box attacks). To
illustrate our findings, we conduct comprehensive experiments on modern LLMs
(e.g., LLaMA-2, LLaMA-3, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Claude-3,
Claude-3.5, OPT, GPT-Neo, GPT-J, Gemma-2, and Pythia) using commonly used
datasets (such as WikiMIA, Pile-CC, and Pile-Wiki) against both word-level and
sentence-level DP. The experimental results show promising recovery rates,
e.g., the black-box attacks against the word-level DP over WikiMIA dataset gave
72.18% on LLaMA-2 (70B), 82.39% on LLaMA-3 (70B), 75.35% on Gemma-2, 91.2% on
ChatGPT-4o, and 94.01% on Claude-3.5 (Sonnet). More urgently, this study
indicates that these well-known LLMs have emerged as a new security risk for
existing DP text sanitization approaches in the current environment.