Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models

TOP Literature Database Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2405.15423

PDF

https://arxiv.org/pdf/2405.15423

Paper Information

Author: Nataša Krčo,Florent Guépin,Matthieu Meeus,Bogdan Kulynych,Yves-Alexandre de Montjoye
Published: 5-24-2024
Updated: 10-16-2025
Affiliation: Department of Computing
Country: United Kingdom
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Membership Inference Evaluation Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Synthetic data generators and machine learning models can memorize their training data, posing privacy concerns. Membership inference attacks (MIAs) are a standard method of estimating the privacy risk of these systems. The risk of individual records is typically computed by evaluating MIAs in a record-specific privacy game. We analyze the record-specific privacy game commonly used for evaluating attackers under realistic assumptions (the \textit{traditional} game) -- particularly for synthetic tabular data -- and show that it averages a record's privacy risk across datasets. We show this implicitly assumes the dataset a record is part of has no impact on the record's risk, providing a misleading risk estimate when a specific model or synthetic dataset is released. Instead, we propose a novel use of the leave-one-out game, used in existing work exclusively to audit differential privacy guarantees, and call this the \textit{model-seeded} game. We formalize it and show that it provides an accurate estimate of the privacy risk posed by a given adversary for a record in its specific dataset. We instantiate and evaluate the state-of-the-art MIA for synthetic data generators in the traditional and model-seeded privacy games, and show across multiple datasets and models that the two privacy games indeed result in different risk scores, with up to 94\% of high-risk records being overlooked by the traditional game. We further show that records in smaller datasets and models not protected by strong differential privacy guarantees tend to have a larger gap between risk estimates. Taken together, our results show that the model-seeded setup yields a risk estimate specific to a certain model or synthetic dataset released and in line with the standard notion of privacy leakage from prior work, meaningfully different from the dataset-averaged risk provided by the traditional privacy game.

External Datasets

Adult

UK Census