These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The responses generated by Large Language Models (LLMs) can include sensitive
information from individuals and organizations, leading to potential privacy
leakage. This work implements Influence Functions (IFs) to trace privacy
leakage back to the training data, thereby mitigating privacy concerns of
Language Models (LMs). However, we notice that current IFs struggle to
accurately estimate the influence of tokens with large gradient norms,
potentially overestimating their influence. When tracing the most influential
samples, this leads to frequently tracing back to samples with large gradient
norm tokens, overshadowing the actual most influential samples even if their
influences are well estimated. To address this issue, we propose Heuristically
Adjusted IF (HAIF), which reduces the weight of tokens with large gradient
norms, thereby significantly improving the accuracy of tracing the most
influential samples. To establish easily obtained groundtruth for tracing
privacy leakage, we construct two datasets, PII-E and PII-CR, representing two
distinct scenarios: one with identical text in the model outputs and
pre-training data, and the other where models leverage their reasoning
abilities to generate text divergent from pre-training data. HAIF significantly
improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E
dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA
IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs
on real-world pretraining data CLUECorpus2020, demonstrating strong robustness
regardless prompt and response lengths.