Deep learning interpretation is essential to explain the reasoning behind
model predictions. Understanding the robustness of interpretation methods is
important especially in sensitive domains such as medical applications since
interpretation results are often used in downstream tasks. Although
gradient-based saliency maps are popular methods for deep learning
interpretation, recent works show that they can be vulnerable to adversarial
attacks. In this paper, we address this problem and provide a certifiable
defense method for deep learning interpretation. We show that a sparsified
version of the popular SmoothGrad method, which computes the average saliency
maps over random perturbations of the input, is certifiably robust against
adversarial perturbations. We obtain this result by extending recent bounds for
certifiably robust smooth classifiers to the interpretation setting.
Experiments on ImageNet samples validate our theory.