Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

Authors: Florian Tramèr
Published: 2021-07-24 | Updated: 2022-06-16

Source: https://arxiv.org/abs/2107.11630

Labels Predicted by AI

Defense Mechanism High Difficulty Sample Role of Machine Learning

Please note that these labels were automatically added by AI. Therefore, they may not be entirely accurate.
For more details, please see the About the Literature Database page.

Abstract

Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance /2. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.