KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

TOP Literature Database KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2410.05725

PDF

https://arxiv.org/pdf/2410.05725

Paper Information

Author: Wenhao Wang;Xiaoyu Liang;Rui Ye;Jingyi Chai;Siheng Chen;Yanfeng Wang
Published: 10-8-2024
Updated: 10-10-2024
Affiliation: Zhejiang University
Country: China
Conference

Labels Estimated by AI

Privacy Protection Privacy Protection Method

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.

External Datasets

HealthCareMagic-100k

financial sentiment analysis dataset

FinGPT