Distributed collaborative learning (DCL) paradigms enable building joint
machine learning models from distrusting multi-party participants. Data
confidentiality is guaranteed by retaining private training data on each
participant's local infrastructure. However, this approach to achieving data
confidentiality makes today's DCL designs fundamentally vulnerable to data
poisoning and backdoor attacks. It also limits DCL's model accountability,
which is key to backtracking the responsible "bad" training data
instances/contributors. In this paper, we introduce CALTRAIN, a Trusted
Execution Environment (TEE) based centralized multi-party collaborative
learning system that simultaneously achieves data confidentiality and model
accountability. CALTRAIN enforces isolated computation on centrally aggregated
training data to guarantee data confidentiality. To support building
accountable learning models, we securely maintain the links between training
instances and their corresponding contributors. Our evaluation shows that the
models generated from CALTRAIN can achieve the same prediction accuracy when
compared to the models trained in non-protected environments. We also
demonstrate that when malicious training participants tend to implant backdoors
during model training, CALTRAIN can accurately and precisely discover the
poisoned and mislabeled training data that lead to the runtime mispredictions.