When and How to Fool Explainable Models (and Humans) with Adversarial Examples

TOP 文献データベース When and How to Fool Explainable Models (and Humans) with Adversarial Examples

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2107.01943

PDF

https://arxiv.org/pdf/2107.01943

文献情報

作者: Jon Vadillo;Roberto Santana;Jose A. Lozano
公開日: 2021-7-5
更新日: 2023-7-7
所属機関: University of the Basque Country UPV/EHU
所属の国: Spain
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

モデルの解釈性敵対的サンプル敵対的攻撃

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this exploratory review, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model's decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing and illustrating novel attack paradigms. In particular, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.