A simple defense against adversarial attacks on heatmap explanations

TOP 文献データベース A simple defense against adversarial attacks on heatmap explanations

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2007.06381

PDF

https://arxiv.org/pdf/2007.06381

文献情報

作者: Laura Rieger,Lars Kai Hansen
公開日: 2020-7-13
所属機関: DTU Compute, Technical University Denmark
所属の国: Denmark
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

防御メカニズムポイズニング攻撃手法

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.