Modern machine learning methods including deep learning have achieved great
success in predictive accuracy for supervised learning tasks, but may still
fall short in giving useful estimates of their predictive {\em uncertainty}.
Quantifying uncertainty is especially critical in real-world settings, which
often involve input distributions that are shifted from the training
distribution due to a variety of factors including sample bias and
non-stationarity. In such settings, well calibrated uncertainty estimates
convey information about when a model's output should (or should not) be
trusted. Many probabilistic deep learning methods, including Bayesian-and
non-Bayesian methods, have been proposed in the literature for quantifying
predictive uncertainty, but to our knowledge there has not previously been a
rigorous large-scale empirical comparison of these methods under dataset shift.
We present a large-scale benchmark of existing state-of-the-art methods on
classification problems and investigate the effect of dataset shift on accuracy
and calibration. We find that traditional post-hoc calibration does indeed fall
short, as do several other previous methods. However, some methods that
marginalize over models give surprisingly strong results across a broad
spectrum of tasks.