AIセキュリティポータル K Program
Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots
Share
Abstract
The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.
A visual analytics conceptual framework for explorable and steerable partial dependence analysis
M. Angelini, G. Blasilli, S. Lenti, G. Santucci
Published: 2023
Machine bias
J. Angwin, J. Larson, S. Mattu, L. Kirchner
Published: 2016
Visualizing the effects of predictor variables in black box supervised learning models
D. W. Apley, J. Zhu
Published: 2020
Manipulating shap via adversarial data perturbations (student abstract)
H. Baniecki, P. Biecek
Published: 2022
Adversarial attacks and defenses in explainable artificial intelligence: A survey
Hubert Baniecki, Przemyslaw Biecek
Published: 6.6.2023
Statistical procedures for forecasting criminal behavior: A comparative assessment
R. A. Berk, J. Bleich
Published: 2013
Post-hoc explanations fail to achieve their purpose in adversarial contexts
S. Bordt, M. Finck, E. Raidl, U. von Luxburg
Published: 2022
Transparency, auditability, and explainability of machine learning models in credit scoring
M. B¨ucker, G. Szepannek, A. Gosiewska, P. Biecek
Published: 2022
Understanding variable effects from black box prediction: Quantifying effects in tree ensembles using partial dependence
G. Cafri, B. A. Bailey
Published: 2016
Share