Differentially Private Tabular Data Synthesis using Large Language Models

TOP Literature Database Differentially Private Tabular Data Synthesis using Large Language Models

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2406.01457

PDF

https://arxiv.org/pdf/2406.01457

Paper Information

Author: Toan V. Tran;Li Xiong
Published: 6-4-2024
Affiliation: Emory University, GA, USA
Country: United States of America
Conference

Labels Estimated by AI

Dataset Generation Model Performance Evaluation Privacy Protection Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Synthetic tabular data generation with differential privacy is a crucial problem to enable data sharing with formal privacy. Despite a rich history of methodological research and development, developing differentially private tabular data generators that can provide realistic synthetic datasets remains challenging. This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis that leverages pretrained large language models (LLMs). DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure with a novel loss function specifically designed for tabular data. Subsequently, it generates synthetic data through sampling the fine-tuned LLMs. Our empirical evaluation demonstrates that DP-LLMTGen outperforms a variety of existing mechanisms across multiple datasets and privacy settings. Additionally, we conduct an ablation study and several experimental analyses to deepen our understanding of LLMs in addressing this important problem. Finally, we highlight the controllable generation ability of DP-LLMTGen through a fairness-constrained generation setting.

External Datasets

Bank marketing

Adult census income

Online food order

Apple quality

Ontime Shipping Classification