TOP Literature Database Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
arxiv
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
AI Security Portal bot
Information in the literature database is collected automatically.
These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Differentially private (DP) synthetic data sets are a solution for sharing
data while preserving the privacy of individual data providers. Understanding
the effects of utilizing DP synthetic data in end-to-end machine learning
pipelines impacts areas such as health care and humanitarian action, where data
is scarce and regulated by restrictive privacy laws. In this work, we
investigate the extent to which synthetic data can replace real, tabular data
in machine learning pipelines and identify the most effective synthetic data
generation techniques for training and evaluating machine learning models. We
investigate the impacts of differentially private synthetic data on downstream
classification tasks from the point of view of utility as well as fairness. Our
analysis is comprehensive and includes representatives of the two main types of
synthetic data generation algorithms: marginal-based and GAN-based. To the best
of our knowledge, our work is the first that: (i) proposes a training and
evaluation framework that does not assume that real data is available for
testing the utility and fairness of machine learning models trained on
synthetic data; (ii) presents the most extensive analysis of synthetic data set
generation algorithms in terms of utility and fairness when used for training
machine learning models; and (iii) encompasses several different definitions of
fairness. Our findings demonstrate that marginal-based synthetic data
generators surpass GAN-based ones regarding model training utility for tabular
data. Indeed, we show that models trained using data generated by
marginal-based algorithms can exhibit similar utility to models trained using
real data. Our analysis also reveals that the marginal-based synthetic data
generator MWEM PGM can train models that simultaneously achieve utility and
fairness characteristics close to those obtained by models trained with real
data.