These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
The availability of large amounts of informative data is crucial for
successful machine learning. However, in domains with sensitive information,
the release of high-utility data which protects the privacy of individuals has
proven challenging. Despite progress in differential privacy and generative
modeling for privacy-preserving data release in the literature, only a few
approaches optimize for machine learning utility: most approaches only take
into account statistical metrics on the data itself and fail to explicitly
preserve the loss metrics of machine learning models that are to be
subsequently trained on the generated data. In this paper, we introduce a data
release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility
for machine learning, while preserving differential privacy. We also describe a
specific implementation of this framework that leverages mixture models to
approximate, kernel-inducing points to adapt, and Gaussian differential privacy
to anonymize a dataset, in order to ensure that the resulting data is both
privacy-preserving and high utility. We present experimental evidence showing
minimal discrepancy between performance metrics of models trained on real
versus privatized datasets, when evaluated on held-out real data. We also
compare our results with several privacy-preserving synthetic data generation
models (such as differentially private generative adversarial networks), and
report significant increases in classification performance metrics compared to
state-of-the-art models. These favorable comparisons show that the presented
framework is a promising direction of research, increasing the utility of
low-risk synthetic data release for machine learning.