We present a novel adversarial detection and correction method for machine
learning classifiers.The detector consists of an autoencoder trained with a
custom loss function based on the Kullback-Leibler divergence between the
classifier predictions on the original and reconstructed instances.The method
is unsupervised, easy to train and does not require any knowledge about the
underlying attack. The detector almost completely neutralises powerful attacks
like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST, and remains very
effective on CIFAR-10 when the attack is granted full access to the
classification model but not the defence. We show that our method is still able
to detect the adversarial examples in the case of a white-box attack where the
attacker has full knowledge of both the model and the defence and investigate
the robustness of the attack. The method is very flexible and can also be used
to detect common data corruptions and perturbations which negatively impact the
model performance. We illustrate this capability on the CIFAR-10-C dataset.