Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory

TOP Literature Database Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2405.05369

PDF

https://arxiv.org/pdf/2405.05369

Paper Information

Author: Pasan Dissanayake;Sanghamitra Dutta
Published: 5-9-2024
Updated: 11-6-2024
Affiliation: University of Maryland
Country: United States of America
Conference: Conference on Neural Information Processing Systems (NeurIPS)

Labels Estimated by AI

Model Performance Evaluation Watermark Evaluation Model Extraction Attack

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Counterfactual explanations provide ways of achieving a favorable model outcome with minimum input perturbation. However, counterfactual explanations can also be leveraged to reconstruct the model by strategically training a surrogate model to give similar predictions as the original (target) model. In this work, we analyze how model reconstruction using counterfactuals can be improved by further leveraging the fact that the counterfactuals also lie quite close to the decision boundary. Our main contribution is to derive novel theoretical relationships between the error in model reconstruction and the number of counterfactual queries required using polytope theory. Our theoretical analysis leads us to propose a strategy for model reconstruction that we call Counterfactual Clamping Attack (CCA) which trains a surrogate model using a unique loss function that treats counterfactuals differently than ordinary instances. Our approach also alleviates the related problem of decision boundary shift that arises in existing model reconstruction approaches when counterfactuals are treated as ordinary instances. Experimental results demonstrate that our strategy improves fidelity between the target and surrogate model predictions on several datasets.

External Datasets

Adult Income

COMPAS

DCCC

HELOC

References

arxiv

Cited by 2

Computing Research Repository (CoRR)

Model extraction from counterfactual explanations

Ulrich Aïvodji, Alexandre Bolot, Sébastien Gambs

Published: 9.4.2020

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.

Causal Interpretation Adversarial attack Model Extraction Attack

American Mathematical Society

A. D. Alexandrov: Selected Works Part II: Intrinsic Geometry of Convex Surfaces

A. D. Aleksandrov

Published: 1967

International conference on machine learning

Input convex neural networks

Brandon Amos, Lei Xu, J Zico Kolter

Published: 2017

ProPublica

Machine bias

J. Angwin, J. Larson, S. Mattu, L. Kirchner

Published: 2016

Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms

Polytope approximation and the Mahler volume

S. Arya, G. D. Da Fonseca, D. M. Mount

Published: 2012

Proceedings of the American Mathematical Society

Cube slicing in Rn

K. Ball

Published: 1986

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

The hidden assumptions behind counterfactual explanations and principal reasons

S. Barocas, A. D. Selbst, M. Raghavan

Published: 2020

Advances in Neural Information Processing Systems

Spectrally-normalized margin bounds for neural networks

Bartlett, P.L., Foster, D.J., Telgarsky, M.J.

Published: 2017

UCI Machine Learning Repository

Adult

B. Becker, R. Kohavi

Published: 1996

International Conference on Learning Representations

Consistent counterfactuals for deep models

E. Black, Z. Wang, M. Fredrikson

Published: 2022

The Annals of Applied Probability

Approximation of smooth convex bodies by random circumscribed polytopes

K. Böröczky Jr, M. Reitzner

Published: 2004

Capital One

Published: 2024

Advances in Neural Information Processing Systems

Improved bounds on neural complexity for representing piecewise linear functions

K.-L. Chen, H. Garudadri, B. D. Rao

Published: 2022

IEEE 35th International Conference on Data Engineering

Constraints-based explanations of classifications

D. Deutch, N. Frost

Published: 2019

Advances in Neural Information Processing Systems

Explanations based on the missing: Towards contrastive explanations with pertinent negatives

A. Dhurandhar, P. Y. Chen, R. Luss, C. C. Tu, P. Ting, K. Shanmugam, P. Das

Published: 2018

International Conference on Machine Learning

Robust counterfactual explanations for tree-based ensembles

S. Dutta, J. Long, S. Mishra, C. Tilli, D. Magazzeni

Published: 2022

FICO

Explainable machine learning challenge

ACM Trans. Intell. Syst. Technol.

The privacy issue of counterfactual explanations: explanation linkage attacks

Sofie Goethals, Kenneth Sörensen, David Martens

Published: 10.22.2022

Black-box machine learning models are being used in more and more high-stakes domains, which creates a growing need for Explainable AI (XAI). Unfortunately, the use of XAI in machine learning introduces new privacy risks, which currently remain largely unnoticed. We introduce the explanation linkage attack, which can occur when deploying instance-based strategies to find counterfactual explanations. To counter such an attack, we propose k-anonymous counterfactual explanations and introduce pureness as a new metric to evaluate the validity of these k-anonymous counterfactual explanations. Our results show that making the explanations, rather than the whole dataset, k- anonymous, is beneficial for the quality of the explanations.

Privacy Violation Counterfactual Explanation Evaluation Method

IEEE Communications Magazine

Model extraction attacks and defenses on cloud-based machine learning models

Xueluan Gong, Qian Wang, Yanjiao Chen, Wang Yang, Xinchang Jiang

Published: 2020

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

Inversenet: Augmenting model extraction attacks with training data inversion

X. Gong, Y. Chen, W. Yang, G. Mei, Q. Wang

Published: 2021

Machine Learning

Regularisation of neural networks by enforcing Lipschitz continuity

Henry Gouk, Eibe Frank, Bernhard Pfahringer, Michael J Cree

Published: 2021

Data Mining and Knowledge Discovery

Counterfactual explanations and how to find them: Literature review and benchmarking

R. Guidotti

Published: 2022

40th International Conference on Machine Learning

Robust counterfactual explanations for neural networks with probabilistic guarantees

F. Hamman, E. Noorani, S. Mishra, D. Magazzeni, S. Dutta

Published: 2023

IEEE Journal on Selected Areas in Information Theory

Robust algorithmic recourse under model multiplicity with probabilistic guarantees

F. Hamman, E. Noorani, S. Mishra, D. Magazzeni, S. Dutta

Published: 2024

International Conference on Machine Learning

Complexity of linear regions in deep networks

B. Hanin, D. Rolnick

Published: 2019

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Published: 2016

arxiv

Cited by 7

USENIX Security Symposium

High Accuracy and High Fidelity Extraction of Neural Networks

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, Nicolas Papernot

Published: 9.4.2019

In a model extraction attack, an adversary steals a copy of a remotely deployed machine learning model, given oracle prediction access. We taxonomize model extraction attacks around two objectives: *accuracy*, i.e., performing well on the underlying learning task, and *fidelity*, i.e., matching the predictions of the remote victim classifier on any input. To extract a high-accuracy model, we develop a learning-based attack exploiting the victim to supervise the training of an extracted model. Through analytical and empirical arguments, we then explain the inherent limitations that prevent any learning-based strategy from extracting a truly high-fidelity model---i.e., extracting a functionally-equivalent model whose predictions are identical to those of the victim model on all possible inputs. Addressing these limitations, we expand on prior work to develop the first practical functionally-equivalent extraction attack for direct extraction (i.e., without training) of a model's weights. We perform experiments both on academic datasets and a state-of-the-art image classifier trained with 1 billion proprietary images. In addition to broadening the scope of model extraction research, our work demonstrates the practicality of model extraction attacks against production-grade systems.

Adversarial Example Model Evaluation Model Extraction Attack

Proceedings of the AAAI Conference on Artificial Intelligence

Formalising the robustness of counterfactual explanations for neural networks

J. Jiang, F. Leofante, A. Rago, F. Toni

Published: 2023

International Conference on Artificial Intelligence and Statistics

Model-agnostic counterfactual explanations for consequential decisions

A.-H. Karimi, G. Barthe, B. Balle, I. Valera

Published: 2020

ACM Computing Surveys

A survey of algorithmic recourse: Contrastive explanations and consequential recommendations

A.-H. Karimi, G. Barthe, B. Schölkopf, I. Valera

Published: 2022

arXiv

Inverse classification for comparison-based interpretability in machine learning

T. Laugel, M.-J. Lesot, C. Marsala, X. Renard, M. Detyniecki

Published: 2017

Graduate Studies in Mathematics. American Mathematical Society

Manifolds and Differential Geometry

J. M. Lee

Published: 2009

Advances in Neural Information Processing Systems

Certified monotonic neural networks

X. Liu, X. Han, N. Zhang, Q. Liu

Published: 2020

ICML Workshop on Deep Learning for Audio, Speech and Language Processing

Rectifier nonlinearities improve neural network acoustic models

Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng

Published: 2013

38th International Conference on Machine Learning

Explanations for monotonic classifiers

J. Marques-Silva, T. Gerspacher, M. C. Cooper, A. Ignatiev, N. Narodytska

Published: 2021

arxiv

Cited by 1

FAT

Model Reconstruction from Model Explanations

Smitha Milli, Ludwig Schmidt, Anca D. Dragan, Moritz Hardt

Published: 7.14.2018

We show through theory and experiment that gradient-based explanations of a model quickly reveal the model itself. Our results speak to a tension between the desire to keep a proprietary model secret and the ability to offer model explanations. On the theoretical side, we give an algorithm that provably learns a two-layer ReLU network in a setting where the algorithm may query the gradient of the model with respect to chosen inputs. The number of queries is independent of the dimension and nearly optimal in its dependence on the model size. Of interest not only from a learning-theoretic perspective, this result highlights the power of gradients rather than labels as a learning primitive. Complementing our theory, we give effective heuristics for reconstructing models from gradient explanations that are orders of magnitude more query-efficient than reconstruction attacks relying on prediction interfaces.

Model Evaluation Model Extraction Attack Query Diversity

arXiv

A Survey on the Robustness of Feature Importance and Counterfactual Explanations

S. Mishra, S. Dutta, J. Long, D. Magazzeni

Published: 2021

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

Explaining machine learning classifiers through diverse counterfactual explanations

R. K. Mothilal, A. Sharma, C. Tan

Published: 2020

IEEE International Conference on Data Mining

Learning classification with auxiliary probabilistic information

Q. Nguyen, H. Valizadegan, M. Hauskrecht

Published: 2011

AMIA Annual Symposium Proceedings

Sample-efficient learning with auxiliary class-label information

Q. Nguyen, H. Valizadegan, A. Seybert, M. Hauskrecht

Published: 2011

arxiv

Cited by 1

Computing Research Repository (CoRR)

I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and Defences

Daryna Oliynyk, Rudolf Mayer, Andreas Rauber

Published: 6.17.2022

Machine Learning-as-a-Service (MLaaS) has become a widespread paradigm, making even the most complex machine learning models available for clients via e.g. a pay-per-query principle. This allows users to avoid time-consuming processes of data collection, hyperparameter tuning, and model training. However, by giving their customers access to the (predictions of their) models, MLaaS providers endanger their intellectual property, such as sensitive training data, optimised hyperparameters, or learned model parameters. Adversaries can create a copy of the model with (almost) identical behavior using the the prediction labels only. While many variants of this attack have been described, only scattered defence strategies have been proposed, addressing isolated threats. This raises the necessity for a thorough systematisation of the field of model stealing, to arrive at a comprehensive understanding why these attacks are successful, and how they could be holistically defended against. We address this by categorising and comparing model stealing attacks, assessing their performance, and exploring corresponding defence techniques in different settings. We propose a taxonomy for attack and defence approaches, and provide guidelines on how to select the right attack or defence strategy based on the goal and available resources. Finally, we analyse which defences are rendered less effective by current attack strategies.

Poisoning Adversarial Attack Methods Membership Inference

The Thirty-Fourth AAAI Conference on Artificial Intelligence

Activethief: Model extraction using active learning and unannotated public data

S. Pal, Y. Gupta, A. Shukla, A. Kanade, S. K. Shevade, V. Ganapathy

Published: 2020

Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security

Practical black-box attacks against machine learning

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, A. Swami

Published: 2017

IEEE Control Systems Letters

Training robust neural networks using Lipschitz bounds

P. Pauli, A. Koch, J. Berberich, P. Kohler, F. Allgöwer

Published: 2021

Proceedings of the web conference 2020

Learning model-agnostic counterfactual explanations for tabular data

M. Pawelczyk, K. Broelemann, G. Kasneci

Published: 2020

Proceedings of the 26th International Conference on Artificial Intelligence and Statistics

On the privacy risks of algorithmic recourse

M. Pawelczyk, H. Lakkaraju, S. Neel

Published: 2023

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015

U-net: Convolutional networks for biomedical image segmentation

O. Ronneberger, P. Fischer, T. Brox

Published: 2015

arxiv

Cited by 1

AAAI/ACM Conference on AI, Ethics, and Society (AIES)

On the Privacy Risks of Model Explanations

Reza Shokri, Martin Strobel, Yair Zick

Published: 6.29.2019

Privacy and transparency are two key foundations of trustworthy machine learning. Model explanations offer insights into a model's decisions on input data, whereas privacy is primarily concerned with protecting information about the training data. We analyze connections between model explanations and the leakage of sensitive information about the model's training set. We investigate the privacy risks of feature-based model explanations using membership inference attacks: quantifying how much model predictions plus their explanations leak information about the presence of a datapoint in the training set of a model. We extensively evaluate membership inference attacks based on feature-based model explanations, over a variety of datasets. We show that backpropagation-based explanations can leak a significant amount of information about individual training datapoints. This is because they reveal statistical information about the decision boundaries of the model about an input, which can reveal its membership. We also empirically investigate the trade-off between privacy and explanation quality, by studying the perturbation-based model explanations.

Explanation Method Membership Inference Adversarial attack

International Conference on Machine Learning

Plug & play attacks: Towards robust and flexible model inversion attacks

Lukas Struppek, Dominik Hintersdorf, Antonio De Almeida Correira, Antonia Adler, Kristian Kersting

Published: 2022

25th USENIX Security Symposium

Stealing machine learning models via prediction APIs

F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, T. Ristenpart

Published: 2016

Advances in Neural Information Processing Systems

Towards robust and reliable algorithmic recourse

S. Upadhyay, S. Joshi, H. Lakkaraju

Published: 2021

Counterfactual explanations and algorithmic recourses for machine learning: A review

Sahil Verma, Varich Boonsanong, Minh Hoang, Keegan E Hines, John P Dickerson, Chirag Shah

Published: 2020

Harvard Journal of Law and Technology

Counterfactual explanations without opening the black box: Automated decisions and the GDPR

S. Wachter, B. Mittelstadt, C. Russell

Published: 2017

arxiv

Cited by 1

DualCF: Efficient Model Extraction Attack from Counterfactual Explanations

Yongjie Wang, Hangwei Qian, Chunyan Miao

Published: 5.13.2022

Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS) platforms to allow users to access large-scale cloudbased models via APIs. In addition to prediction outputs, these APIs can also provide other information in a more human-understandable way, such as counterfactual explanations (CF). However, such extra information inevitably causes the cloud models to be more vulnerable to extraction attacks which aim to steal the internal functionality of models in the cloud. Due to the black-box nature of cloud models, however, a vast number of queries are inevitably required by existing attack strategies before the substitute model achieves high fidelity. In this paper, we propose a novel simple yet efficient querying strategy to greatly enhance the querying efficiency to steal a classification model. This is motivated by our observation that current querying strategies suffer from decision boundary shift issue induced by taking far-distant queries and close-to-boundary CFs into substitute model training. We then propose DualCF strategy to circumvent the above issues, which is achieved by taking not only CF but also counterfactual explanation of CF (CCF) as pairs of training samples for the substitute model. Extensive and comprehensive experimental evaluations are conducted on both synthetic and real-world datasets. The experimental results favorably illustrate that DualCF can produce a high-fidelity model with fewer queries efficiently and effectively.

Query Generation Method Attack Method Attack Methods against DFL

arXiv

Xaudit: A theoretical look at auditing with explanations

C. Yadav, M. Moshkovitz, K. Chaudhuri

Published: 2023

UCI Machine Learning Repository

default of credit card clients

Yeh, I-Cheng

Published: 2016

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

On rectified linear units for speech processing

M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, G. Hinton

Published: 2013

arxiv

Cited by 1

IEEE International Conference on Computer Vision (ICCV)

Exploiting Explanations for Model Inversion Attacks

Xuejun Zhao, Wencan Zhang, Xiaokui Xiao, Brian Y. Lim

Published: 4.27.2021

The successful deployment of artificial intelligence (AI) in many domains from healthcare to hiring requires their responsible use, particularly in model explanations and privacy. Explainable artificial intelligence (XAI) provides more information to help users to understand model decisions, yet this additional knowledge exposes additional risks for privacy attacks. Hence, providing explanation harms privacy. We study this risk for image-based model inversion attacks and identified several attack architectures with increasing performance to reconstruct private image data from model explanations. We have developed several multi-modal transposed CNN architectures that achieve significantly higher inversion performance than using the target model prediction only. These XAI-aware inversion models were designed to exploit the spatial knowledge in image explanations. To understand which explanations have higher privacy risk, we analyzed how various explanation types and factors influence inversion performance. In spite of some models not providing explanations, we further demonstrate increased inversion performance even for non-explainable target models by exploiting explanations of surrogate models through attention transfer. This method first inverts an explanation from the target prediction, then reconstructs the target image. These threats highlight the urgent and significant privacy risks of explanations and calls attention for new privacy preservation techniques that balance the dual-requirement for AI explainability and privacy.

XAI (Explainable AI) Model Inversion Privacy Technique