These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Model memorization has implications for both the generalization capacity of
machine learning models and the privacy of their training data. This paper
investigates label memorization in binary classification models through two
novel passive label inference attacks (BLIA). These attacks operate passively,
relying solely on the outputs of pre-trained models, such as confidence scores
and log-loss values, without interacting with or modifying the training
process. By intentionally flipping 50% of the labels in controlled subsets,
termed "canaries," we evaluate the extent of label memorization under two
conditions: models trained without label differential privacy (Label-DP) and
those trained with randomized response-based Label-DP. Despite the application
of varying degrees of Label-DP, the proposed attacks consistently achieve
success rates exceeding 50%, surpassing the baseline of random guessing and
conclusively demonstrating that models memorize training labels, even when
these labels are deliberately uncorrelated with the features.