These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Synthetic data generators and machine learning models can memorize their
training data, posing privacy concerns. Membership inference attacks (MIAs) are
a standard method of estimating the privacy risk of these systems. The risk of
individual records is typically computed by evaluating MIAs in a
record-specific privacy game. We analyze the record-specific privacy game
commonly used for evaluating attackers under realistic assumptions (the
\textit{traditional} game) -- particularly for synthetic tabular data -- and
show that it averages a record's privacy risk across datasets. We show this
implicitly assumes the dataset a record is part of has no impact on the
record's risk, providing a misleading risk estimate when a specific model or
synthetic dataset is released. Instead, we propose a novel use of the
leave-one-out game, used in existing work exclusively to audit differential
privacy guarantees, and call this the \textit{model-seeded} game. We formalize
it and show that it provides an accurate estimate of the privacy risk posed by
a given adversary for a record in its specific dataset. We instantiate and
evaluate the state-of-the-art MIA for synthetic data generators in the
traditional and model-seeded privacy games, and show across multiple datasets
and models that the two privacy games indeed result in different risk scores,
with up to 94\% of high-risk records being overlooked by the traditional game.
We further show that records in smaller datasets and models not protected by
strong differential privacy guarantees tend to have a larger gap between risk
estimates. Taken together, our results show that the model-seeded setup yields
a risk estimate specific to a certain model or synthetic dataset released and
in line with the standard notion of privacy leakage from prior work,
meaningfully different from the dataset-averaged risk provided by the
traditional privacy game.