Recent studies have shown that adversarial examples in state-of-the-art image
classifiers trained by deep neural networks (DNN) can be easily generated when
the target model is transparent to an attacker, known as the white-box setting.
However, when attacking a deployed machine learning service, one can only
acquire the input-output correspondences of the target model; this is the
so-called black-box attack setting. The major drawback of existing black-box
attacks is the need for excessive model queries, which may give a false sense
of model robustness due to inefficient query designs. To bridge this gap, we
propose a generic framework for query-efficient black-box attacks. Our
framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order
Optimization Method, has two novel building blocks towards efficient black-box
attacks: (i) an adaptive random gradient estimation strategy to balance query
counts and distortion, and (ii) an autoencoder that is either trained offline
with unlabeled data or a bilinear resizing operation for attack acceleration.
Experimental results suggest that, by applying AutoZOOM to a state-of-the-art
black-box attack (ZOO), a significant reduction in model queries can be
achieved without sacrificing the attack success rate and the visual quality of
the resulting adversarial examples. In particular, when compared to the
standard ZOO method, AutoZOOM can consistently reduce the mean query counts in
finding successful adversarial examples (or reaching the same distortion level)
by at least 93% on MNIST, CIFAR-10 and ImageNet datasets, leading to novel
insights on adversarial robustness.