Data synthesis has been advocated as an important approach for utilizing data
while protecting data privacy. A large number of tabular data synthesis
algorithms (which we call synthesizers) have been proposed. Some synthesizers
satisfy Differential Privacy, while others aim to provide privacy in a
heuristic fashion. A comprehensive understanding of the strengths and
weaknesses of these synthesizers remains elusive due to drawbacks in evaluation
metrics and missing head-to-head comparisons of newly developed synthesizers
that take advantage of diffusion models and large language models with
state-of-the-art marginal-based synthesizers.
In this paper, we present a systematic evaluation framework for assessing
tabular data synthesis algorithms. Specifically, we examine and critique
existing evaluation metrics, and introduce a set of new metrics in terms of
fidelity, privacy, and utility to address their limitations. Based on the
proposed metrics, we also devise a unified objective for tuning, which can
consistently improve the quality of synthetic data for all methods. We
conducted extensive evaluations of 8 different types of synthesizers on 12
real-world datasets and identified some interesting findings, which offer new
directions for privacy-preserving data synthesis.
外部データセット
Adult
Shoppers
Phishing
Magic
Faults
Bean
Obesity
Robot
Abalone
News
Insurance
Wine
参考文献
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Optuna: A next-generation hyperparameter optimization framework
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama
Published: 2019
International Conference on Machine Learning
How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models
Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, Mihaela van der Schaar