Adversarial examples threaten the integrity of machine learning systems with
alarming success rates even under constrained black-box conditions. Stateful
defenses have emerged as an effective countermeasure, detecting potential
attacks by maintaining a buffer of recent queries and detecting new queries
that are too similar. However, these defenses fundamentally pose a trade-off
between attack detection and false positive rates, and this trade-off is
typically optimized by hand-picking feature extractors and similarity
thresholds that empirically work well. There is little current understanding as
to the formal limits of this trade-off and the exact properties of the feature
extractors/underlying problem domain that influence it. This work aims to
address this gap by offering a theoretical characterization of the trade-off
between detection and false positive rates for stateful defenses. We provide
upper bounds for detection rates of a general class of feature extractors and
analyze the impact of this trade-off on the convergence of black-box attacks.
We then support our theoretical findings with empirical evaluations across
multiple datasets and stateful defenses.