The adoption of artificial intelligence (AI) across industries has led to the
widespread use of complex black-box models and interpretation tools for
decision making. This paper proposes an adversarial framework to uncover the
vulnerability of permutation-based interpretation methods for machine learning
tasks, with a particular focus on partial dependence (PD) plots. This
adversarial framework modifies the original black box model to manipulate its
predictions for instances in the extrapolation domain. As a result, it produces
deceptive PD plots that can conceal discriminatory behaviors while preserving
most of the original model's predictions. This framework can produce multiple
fooled PD plots via a single model. By using real-world datasets including an
auto insurance claims dataset and COMPAS (Correctional Offender Management
Profiling for Alternative Sanctions) dataset, our results show that it is
possible to intentionally hide the discriminatory behavior of a predictor and
make the black-box model appear neutral through interpretation tools like PD
plots while retaining almost all the predictions of the original black-box
model. Managerial insights for regulators and practitioners are provided based
on the findings.