We propose a scheme for defending against adversarial attacks by suppressing
the largest eigenvalue of the Fisher information matrix (FIM). Our starting
point is one explanation on the rationale of adversarial examples. Based on the
idea of the difference between a benign sample and its adversarial example is
measured by the Euclidean norm, while the difference between their
classification probability densities at the last (softmax) layer of the network
could be measured by the Kullback-Leibler (KL) divergence, the explanation
shows that the output difference is a quadratic form of the input difference.
If the eigenvalue of this quadratic form (a.k.a. FIM) is large, the output
difference becomes large even when the input difference is small, which
explains the adversarial phenomenon. This makes the adversarial defense
possible by controlling the eigenvalues of the FIM. Our solution is adding one
term representing the trace of the FIM to the loss function of the original
network, as the largest eigenvalue is bounded by the trace. Our defensive
scheme is verified by experiments using a variety of common attacking methods
on typical deep neural networks, e.g. LeNet, VGG and ResNet, with datasets
MNIST, CIFAR-10, and German Traffic Sign Recognition Benchmark (GTSRB). Our new
network, after adopting the novel loss function and retraining, has an
effective and robust defensive capability, as it decreases the fooling ratio of
the generated adversarial examples, and remains the classification accuracy of
the original network.