These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Adversarial examples threaten the integrity of machine learning systems with
alarming success rates even under constrained black-box conditions. Stateful
defenses have emerged as an effective countermeasure, detecting potential
attacks by maintaining a buffer of recent queries and detecting new queries
that are too similar. However, these defenses fundamentally pose a trade-off
between attack detection and false positive rates, and this trade-off is
typically optimized by hand-picking feature extractors and similarity
thresholds that empirically work well. There is little current understanding as
to the formal limits of this trade-off and the exact properties of the feature
extractors/underlying problem domain that influence it. This work aims to
address this gap by offering a theoretical characterization of the trade-off
between detection and false positive rates for stateful defenses. We provide
upper bounds for detection rates of a general class of feature extractors and
analyze the impact of this trade-off on the convergence of black-box attacks.
We then support our theoretical findings with empirical evaluations across
multiple datasets and stateful defenses.