Deep neural networks (DNNs) are inherently susceptible to adversarial attacks
even under black-box settings, in which the adversary only has query access to
the target models. In practice, while it may be possible to effectively detect
such attacks (e.g., observing massive similar but non-identical queries), it is
often challenging to exactly infer the adversary intent (e.g., the target class
of the adversarial example the adversary attempts to craft) especially during
early stages of the attacks, which is crucial for performing effective
deterrence and remediation of the threats in many scenarios.
In this paper, we present AdvMind, a new class of estimation models that
infer the adversary intent of black-box adversarial attacks in a robust and
prompt manner. Specifically, to achieve robust detection, AdvMind accounts for
the adversary adaptiveness such that her attempt to conceal the target will
significantly increase the attack cost (e.g., in terms of the number of
queries); to achieve prompt detection, AdvMind proactively synthesizes
plausible query results to solicit subsequent queries from the adversary that
maximally expose her intent. Through extensive empirical evaluation on
benchmark datasets and state-of-the-art black-box attacks, we demonstrate that
on average AdvMind detects the adversary intent with over 75% accuracy after
observing less than 3 query batches and meanwhile increases the cost of
adaptive attacks by over 60%. We further discuss the possible synergy between
AdvMind and other defense methods against black-box adversarial attacks,
pointing to several promising research directions.