Adversarial training is a training scheme designed to counter adversarial
attacks by augmenting the training dataset with adversarial examples.
Surprisingly, several studies have observed that loss gradients from
adversarially trained DNNs are visually more interpretable than those from
standard DNNs. Although this phenomenon is interesting, there are only few
works that have offered an explanation. In this paper, we attempted to bridge
this gap between adversarial robustness and gradient interpretability. To this
end, we identified that loss gradients from adversarially trained DNNs align
better with human perception because adversarial training restricts gradients
closer to the image manifold. We then demonstrated that adversarial training
causes loss gradients to be quantitatively meaningful. Finally, we showed that
under the adversarial training framework, there exists an empirical trade-off
between test accuracy and loss gradient interpretability and proposed two
potential approaches to resolving this trade-off.