Objective: To enable privacy-preserving learning of high quality generative
and discriminative machine learning models from distributed electronic health
records.
Methods and Results: We describe general and scalable strategy to build
machine learning models in a provably privacy-preserving way. Compared to the
standard approaches using, e.g., differential privacy, our method does not
require alteration of the input biomedical data, works with completely or
partially distributed datasets, and is resilient as long as the majority of the
sites participating in data processing are trusted to not collude. We show how
the proposed strategy can be applied on distributed medical records to solve
the variables assignment problem, the key task in exact feature selection and
Bayesian networks learning.
Conclusions: Our proposed architecture can be used by health care
organizations, spanning providers, insurers, researchers and computational
service providers, to build robust and high quality predictive models in cases
where distributed data has to be combined without being disclosed, altered or
otherwise compromised.