Computing an AUC as a performance measure to compare the quality of different
machine learning models is one of the final steps of many research projects.
Many of these methods are trained on privacy-sensitive data and there are
several different approaches like $\epsilon$-differential privacy, federated
machine learning and cryptography if the datasets cannot be shared or used
jointly at one place for training and/or testing. In this setting, it can also
be a problem to compute the global AUC, since the labels might also contain
privacy-sensitive information. There have been approaches based on
$\epsilon$-differential privacy to address this problem, but to the best of our
knowledge, no exact privacy preserving solution has been introduced. In this
paper, we propose an MPC-based solution, called ppAURORA, with private merging
of individually sorted lists from multiple sources to compute the exact AUC as
one could obtain on the pooled original test samples. With ppAURORA, the
computation of the exact area under precision-recall and receiver operating
characteristic curves is possible even when ties between prediction confidence
values exist. We use ppAURORA to evaluate two different models predicting acute
myeloid leukemia therapy response and heart disease, respectively. We also
assess its scalability via synthetic data experiments. All these experiments
show that we efficiently and privately compute the exact same AUC with both
evaluation metrics as one can obtain on the pooled test samples in plaintext
according to the semi-honest adversary setting.