Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots | AI Security Portal

JA

JA

EN

TOP Literature Database Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

arxiv

Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2404.18702

PDF

https://arxiv.org/pdf/2404.18702

Paper Information

Author: Xi Xin;Giles Hooker;Fei Huang
Published: 4-29-2024
Updated: 5-1-2024
Affiliation: UNSW Sydney, School of Risk and Actuarial Studies
Country: Australia
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Model Interpretability Adversarial Training Watermark Evaluation

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.

References

IEEE Transactions on Visualization and Computer Graphics

A visual analytics conceptual framework for explorable and steerable partial dependence analysis

M. Angelini, G. Blasilli, S. Lenti, G. Santucci

Published: 2023

Machine bias

J. Angwin, J. Larson, S. Mattu, L. Kirchner

Published: 2016

Journal of the Royal Statistical Society Series B: Statistical Methodology

Visualizing the effects of predictor variables in black box supervised learning models

D. W. Apley, J. Zhu

Published: 2020

Proceedings of the aaai conference on artificial intelligence

Manipulating shap via adversarial data perturbations (student abstract)

H. Baniecki, P. Biecek

Published: 2022

Computing Research Repository (CoRR)

Adversarial attacks and defenses in explainable artificial intelligence: A survey

Hubert Baniecki, Przemyslaw Biecek

Published: 6.6.2023

Explainable artificial intelligence (XAI) methods are portrayed as a remedy for debugging and trusting statistical and deep learning models, as well as interpreting their predictions. However, recent advances in adversarial machine learning (AdvML) highlight the limitations and vulnerabilities of state-of-the-art explanation methods, putting their security and trustworthiness into question. The possibility of manipulating, fooling or fairwashing evidence of the model's reasoning has detrimental consequences when applied in high-stakes decision-making and knowledge discovery. This survey provides a comprehensive overview of research concerning adversarial attacks on explanations of machine learning models, as well as fairness metrics. We introduce a unified notation and taxonomy of methods facilitating a common ground for researchers and practitioners from the intersecting research fields of AdvML and XAI. We discuss how to defend against attacks and design robust interpretation methods. We contribute a list of existing insecurities in XAI and outline the emerging research directions in adversarial XAI (AdvXAI). Future work should address improving explanation methods and evaluation protocols to take into account the reported safety issues.

Attack Method Adversarial Example Membership Inference

Fooling Partial Dependence via Data Poisoning

Hubert Baniecki, Wojciech Kretowicz, Przemyslaw Biecek

Published: 5.27.2021

Many methods have been developed to understand complex predictive models and high expectations are placed on post-hoc model explainability. It turns out that such explanations are not robust nor trustworthy, and they can be fooled. This paper presents techniques for attacking Partial Dependence (plots, profiles, PDP), which are among the most popular methods of explaining any predictive model trained on tabular data. We showcase that PD can be manipulated in an adversarial manner, which is alarming, especially in financial or medical applications where auditability became a must-have trait supporting black-box machine learning. The fooling is performed via poisoning the data to bend and shift explanations in the desired direction using genetic and gradient algorithms. We believe this to be the first work using a genetic algorithm for manipulating explanations, which is transferable as it generalizes both ways: in a model-agnostic and an explanation-agnostic manner.

Vulnerability Assessment Method Poisoning Data Contamination Detection

Criminology & Public Policy

Statistical procedures for forecasting criminal behavior: A comparative assessment

R. A. Berk, J. Bleich

Published: 2013

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

Post-hoc explanations fail to achieve their purpose in adversarial contexts

S. Bordt, M. Finck, E. Raidl, U. von Luxburg

Published: 2022

Vine: Visualizing statistical interactions in black box models

Published: 2019

Journal of the Operational Research Society

Transparency, auditability, and explainability of machine learning models in credit scoring

M. B¨ucker, G. Szepannek, A. Gosiewska, P. Biecek

Published: 2022

Journal of Data Science

Understanding variable effects from black box prediction: Quantifying effects in tree ensembles using partial dependence

G. Cafri, B. A. Bailey

Published: 2016