Deep neural networks are vulnerable to adversarial examples. Prior defenses
attempted to make deep networks more robust by either changing the network
architecture or augmenting the training set with adversarial examples, but both
have inherent limitations. Motivated by recent research that shows outliers in
the training set have a high negative influence on the trained model, we
studied the relationship between model robustness and the quality of the
training set. We first show that outliers give the model better generalization
ability but weaker robustness. Next, we propose an adversarial example
detection framework, in which we design two methods for removing outliers from
training set to obtain the sanitized model and then detect adversarial example
by calculating the difference of outputs between the original and the sanitized
model. We evaluated the framework on both MNIST and SVHN. Based on the
difference measured by Kullback-Leibler divergence, we could detect adversarial
examples with accuracy between 94.67% to 99.89%.