AIセキュリティポータル K Program
Fooling SHAP with Output Shuffling Attacks
Share
Abstract
Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.
A comparison of regression models for prediction of graduate admissions
M. S. Acharya, A. Armaan, A. S. Antony
Published: 2019
Fooling SHAP with Stealthily Biased Sampling
U. Aïvodji, S. Hara, M. Marchand, F. Khomh
Published: 2022
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez, D. Molina, R. Benjamins
Published: 2020
Adversarial attacks and defenses in explainable artificial intelligence: A survey
Hubert Baniecki, Przemyslaw Biecek
Published: 2023.6.6
AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias
R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, Y. Zhang
Published: 2018
From Shapley values to generalized additive models and back
S. Bordt, U. von Luxburg
Published: 2023
You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods
B. Dimanov, U. Bhatt, M. Jamnik, A. Weller
Published: 2020
Likelihood prediction of diabetes at early stage using data mining techniques
M. Islam, R. Ferdousi, S. Rahman, H. Y. Bushra
Published: 2020
Fool SHAP with Stealthily Biased Sampling
G. Laberge, U. Aïvodji, S. Hara, M. Marchand, F. Khomh
Published: 2023
A Unified Approach to Interpreting Model Predictions
Scott Lundberg, Su-In Lee
Published: 2017.5.23
Disguising attacks with explanation-aware backdoors
M. Noppel, L. Peter, C. Wressnegger
Published: 2023
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
Published: 2016.2.16
Reliable post hoc explanations: Modeling uncertainty in explainability
D. Slack, A. Hilgard, S. Singh, H. Lakkaraju
Published: 2021
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu Lakkaraju
Published: 2019.11.7
A unified approach to quantifying algorithmic unfairness: Measuring individual &group unfairness via inequality indices
T. Speicher, H. Heidari, N. Grgic-Hlaca, K. P. Gummadi, A. Singla, A. Weller, M. B. Zafar
Published: 2018
Measuring Fairness in Ranked Outputs
K. Yang, J. Stoyanovich
Published: 2017
TRIVEA: transparent ranking interpretation using visual explanation of black-box algorithmic rankers
J. Yuan, K. Bhattacharjee, A. Z. Islam, A. Dasgupta
Published: 2023
A Human-in-the-loop Workflow for Multi-Factorial Sensitivity Analysis of Algorithmic Rankers
J. Yuan, A. Dasgupta
Published: 2023
Share