Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

TOP 文献データベース Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2404.18702

PDF

https://arxiv.org/pdf/2404.18702

文献情報

作者: Xi Xin;Giles Hooker;Fei Huang
公開日: 2024-4-29
更新日: 2024-5-1
所属機関: UNSW Sydney, School of Risk and Actuarial Studies
所属の国: Australia
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

モデルの解釈性敵対的訓練透かし評価

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.