Deep neural networks (DNNs) have had many successes, but they suffer from two
major issues: (1) a vulnerability to adversarial examples and (2) a tendency to
elude human interpretation. Interestingly, recent empirical and theoretical
evidence suggests these two seemingly disparate issues are actually connected.
In particular, robust models tend to provide more interpretable gradients than
non-robust models. However, whether this relationship works in the opposite
direction remains obscure. With this paper, we seek empirical answers to the
following question: can models acquire adversarial robustness when they are
trained to have interpretable gradients? We introduce a theoretically inspired
technique called Interpretation Regularization (IR), which encourages a model's
gradients to (1) match the direction of interpretable target salience maps and
(2) have small magnitude. To assess model performance and tease apart factors
that contribute to adversarial robustness, we conduct extensive experiments on
MNIST and CIFAR-10 with both $\ell_2$ and $\ell_\infty$ attacks. We demonstrate
that training the networks to have interpretable gradients improves their
robustness to adversarial perturbations. Applying the network interpretation
technique SmoothGrad yields additional performance gains, especially in
cross-norm attacks and under heavy perturbations. The results indicate that the
interpretability of the model gradients is a crucial factor for adversarial
robustness. Code for the experiments can be found at
https://github.com/a1noack/interp_regularization.