These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Introduction: The amount of data generated by original research is growing
exponentially. Publicly releasing them is recommended to comply with the Open
Science principles. However, data collected from human participants cannot be
released as-is without raising privacy concerns. Fully synthetic data represent
a promising answer to this challenge. This approach is explored by the French
Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in
the form of a synthetic data generation framework based on Classification and
Regression Trees and an original distance-based filtering. The goal of this
work was to develop a refined version of this framework and to assess its
risk-utility profile with empirical and formal tools, including novel ones
developed for the purpose of this evaluation.Materials and Methods: Our
synthesis framework consists of four successive steps, each of which is
designed to prevent specific risks of disclosure. We assessed its performance
by applying two or more of these steps to a rich epidemiological dataset.
Privacy and utility metrics were computed for each of the resulting synthetic
datasets, which were further assessed using machine learning
approaches.Results: Computed metrics showed a satisfactory level of protection
against attribute disclosure attacks for each synthetic dataset, especially
when the full framework was used. Membership disclosure attacks were formally
prevented without significantly altering the data. Machine learning approaches
showed a low risk of success for simulated singling out and linkability
attacks. Distributional and inferential similarity with the original data were
high with all datasets.Discussion: This work showed the technical feasibility
of generating publicly releasable synthetic data using a multi-step framework.
Formal and empirical tools specifically developed for this demonstration are a
valuable contribution to this field. Further research should focus on the
extension and validation of these tools, in an effort to specify the intrinsic
qualities of alternative data synthesis methods.Conclusion: By successfully
assessing the quality of data produced using a novel multi-step synthetic data
generation framework, we showed the technical and conceptual soundness of the
Open-CESP initiative, which seems ripe for full-scale implementation.
External Datasets
Mental Health In Prison study dataset
synthetic versions of the French National Health Data System (SNDS)
NIH National COVID Cohort Collaborative
cancer data provided by the National Disease Registration Service