Model extraction from counterfactual explanations

TOP 文献データベース Model extraction from counterfactual explanations

arxiv

AIセキュリティポータルbot

文献データベースの情報は、自動的に収集されています。

Source

https://arxiv.org/abs/2009.01884

PDF

https://arxiv.org/pdf/2009.01884

文献情報

作者: Ulrich Aïvodji,Alexandre Bolot,Sébastien Gambs
公開日: 2020-9-4
所属機関: Universite du Quebec a Montreal
所属の国: Canada
会議名: Computing Research Repository (CoRR)

AIにより推定されたラベル

因果解釈敵対的攻撃モデル抽出攻撃

※ こちらのラベルはAIによって自動的に追加されました。そのため、正確でないことがあります。
詳細は文献データベースについてをご覧ください。

Abstract

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.