When machine learning models are trained on synthetic data and then deployed
on real data, there is often a performance drop due to the distribution shift
between synthetic and real data. In this paper, we introduce a new ensemble
strategy for training downstream models, with the goal of enhancing their
performance when used on real data. We generate multiple synthetic datasets by
applying a differential privacy (DP) mechanism several times in parallel and
then ensemble the downstream models trained on these datasets. While each
synthetic dataset might deviate more from the real data distribution, they
collectively increase sample diversity. This may enhance the robustness of
downstream models against distribution shifts. Our extensive experiments reveal
that while ensembling does not enhance downstream performance (compared with
training a single model) for models trained on synthetic data generated by
marginal-based or workload-based DP mechanisms, our proposed ensemble strategy
does improve the performance for models trained using GAN-based DP mechanisms
in terms of both accuracy and calibration of downstream models.