Bridging Adversarial Robustness and Gradient Interpretability

TOP 文献データベース Bridging Adversarial Robustness and Gradient Interpretability

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/1903.11626

PDF

https://arxiv.org/pdf/1903.11626

文献情報

作者: Beomsu Kim,Junghoon Seo,Taegyun Jeon
公開日: 2019-3-28
更新日: 2019-4-19
所属機関: School of Computing, Korea Advanced Institute of Science and Technology
所属の国: Republic of Korea
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

解釈可能性敵対的学習モデルの頑健性保証

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Adversarial training is a training scheme designed to counter adversarial attacks by augmenting the training dataset with adversarial examples. Surprisingly, several studies have observed that loss gradients from adversarially trained DNNs are visually more interpretable than those from standard DNNs. Although this phenomenon is interesting, there are only few works that have offered an explanation. In this paper, we attempted to bridge this gap between adversarial robustness and gradient interpretability. To this end, we identified that loss gradients from adversarially trained DNNs align better with human perception because adversarial training restricts gradients closer to the image manifold. We then demonstrated that adversarial training causes loss gradients to be quantitatively meaningful. Finally, we showed that under the adversarial training framework, there exists an empirical trade-off between test accuracy and loss gradient interpretability and proposed two potential approaches to resolving this trade-off.