In a model extraction attack, an adversary steals a copy of a remotely
deployed machine learning model, given oracle prediction access. We taxonomize
model extraction attacks around two objectives: *accuracy*, i.e., performing
well on the underlying learning task, and *fidelity*, i.e., matching the
predictions of the remote victim classifier on any input.
To extract a high-accuracy model, we develop a learning-based attack
exploiting the victim to supervise the training of an extracted model. Through
analytical and empirical arguments, we then explain the inherent limitations
that prevent any learning-based strategy from extracting a truly high-fidelity
model---i.e., extracting a functionally-equivalent model whose predictions are
identical to those of the victim model on all possible inputs. Addressing these
limitations, we expand on prior work to develop the first practical
functionally-equivalent extraction attack for direct extraction (i.e., without
training) of a model's weights.
We perform experiments both on academic datasets and a state-of-the-art image
classifier trained with 1 billion proprietary images. In addition to broadening
the scope of model extraction research, our work demonstrates the practicality
of model extraction attacks against production-grade systems.