With the advancement of machine learning (ML) and its growing awareness, many
organizations who own data but not ML expertise (data owner) would like to pool
their data and collaborate with those who have expertise but need data from
diverse sources to train truly generalizable models (model owner). In such
collaborative ML, the data owner wants to protect the privacy of its training
data, while the model owner desires the confidentiality of the model and the
training method which may contain intellectual properties. However, existing
private ML solutions, such as federated learning and split learning, cannot
meet the privacy requirements of both data and model owners at the same time.
This paper presents Citadel, a scalable collaborative ML system that protects
the privacy of both data owner and model owner in untrusted infrastructures
with the help of Intel SGX. Citadel performs distributed training across
multiple training enclaves running on behalf of data owners and an aggregator
enclave on behalf of the model owner. Citadel further establishes a strong
information barrier between these enclaves by means of zero-sum masking and
hierarchical aggregation to prevent data/model leakage during collaborative
training. Compared with the existing SGX-protected training systems, Citadel
enables better scalability and stronger privacy guarantees for collaborative
ML. Cloud deployment with various ML models shows that Citadel scales to a
large number of enclaves with less than 1.73X slowdown caused by SGX.