Systematic Assessment of Tabular Data Synthesis Algorithms | AIセキュリティポータル

EN

JA

EN

TOP 文献データベース Systematic Assessment of Tabular Data Synthesis Algorithms

arxiv

Systematic Assessment of Tabular Data Synthesis Algorithms

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2402.06806

PDF

https://arxiv.org/pdf/2402.06806

文献情報

作者: Yuntao Du;Ninghui Li
公開日: 2024-2-10
更新日: 2024-4-13
所属機関: Purdue University
所属の国: United States of America
会議名

AIにより推定されたラベル

データプライバシー評価データ生成プライバシー保護手法

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Data synthesis has been advocated as an important approach for utilizing data while protecting data privacy. A large number of tabular data synthesis algorithms (which we call synthesizers) have been proposed. Some synthesizers satisfy Differential Privacy, while others aim to provide privacy in a heuristic fashion. A comprehensive understanding of the strengths and weaknesses of these synthesizers remains elusive due to drawbacks in evaluation metrics and missing head-to-head comparisons of newly developed synthesizers that take advantage of diffusion models and large language models with state-of-the-art marginal-based synthesizers. In this paper, we present a systematic evaluation framework for assessing tabular data synthesis algorithms. Specifically, we examine and critique existing evaluation metrics, and introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data for all methods. We conducted extensive evaluations of 8 different types of synthesizers on 12 real-world datasets and identified some interesting findings, which offer new directions for privacy-preserving data synthesis.

外部データセット

Adult

Shoppers

Phishing

Magic

Faults

Bean

Obesity

Robot

Abalone

News

Insurance

Wine

参考文献

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Optuna: A next-generation hyperparameter optimization framework

T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama

Published: 2019

International Conference on Machine Learning

How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models

Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, Mihaela van der Schaar

Published: 2022

Really Useful Synthetic Data–A Framework to Evaluate the Quality of Differentially Private Synthetic Data

Christian Arnold, Marcel Neunhoeffer

Published: 2020

How to evaluate the quality of the synthetic data: measuring from the perspective of fidelity, utility, and privacy

Published: 2022

The creation and use of the SIPP Synthetic

Gary Benedetto, Martha Stinson, John M Abowd

Published: 2018

Journal of Mathematical Imaging and Vision

Sliced and radon wasserstein barycenters of measures

Nicolas Bonneel, Julien Rabin, Gabriel Peyré, Hanspeter Pfister

Published: 2015

Language models are realistic tabular data generators

V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, G. Kasneci

Published: 2022

The Journal of Machine Learning Research

Stability and generalization

Olivier Bousquet, André Elisseeff

Published: 2002

Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge

Claire McKay Bowen, Joshua Snoke

Published: 2019

2022 IEEE Symposium on Security and Privacy (SP)

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

Published: 2022