These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Machine learning models are known to memorize samples from their training
data, raising concerns around privacy and generalization. Counterfactual
self-influence is a popular metric to study memorization, quantifying how the
model's prediction for a sample changes depending on the sample's inclusion in
the training dataset. However, recent work has shown memorization to be
affected by factors beyond self-influence, with other training samples, in
particular (near-)duplicates, having a large impact. We here study memorization
treating counterfactual influence as a distributional quantity, taking into
account how all training samples influence how a sample is memorized. For a
small language model, we compute the full influence distribution of training
samples on each other and analyze its properties. We find that solely looking
at self-influence can severely underestimate tangible risks associated with
memorization: the presence of (near-)duplicates seriously reduces
self-influence, while we find these samples to be (near-)extractable. We
observe similar patterns for image classification, where simply looking at the
influence distributions reveals the presence of near-duplicates in CIFAR-10.
Our findings highlight that memorization stems from complex interactions across
training data and is better captured by the full influence distribution than by
self-influence alone.