Deep Neural Networks (DNNs) are often criticized for being susceptible to
adversarial attacks. Most successful defense strategies adopt adversarial
training or random input transformations that typically require retraining or
fine-tuning the model to achieve reasonable performance. In this work, our
investigations of intermediate representations of a pre-trained DNN lead to an
interesting discovery pointing to intrinsic robustness to adversarial attacks.
We find that we can learn a generative classifier by statistically
characterizing the neural response of an intermediate layer to clean training
samples. The predictions of multiple such intermediate-layer based classifiers,
when aggregated, show unexpected robustness to adversarial attacks.
Specifically, we devise an ensemble of these generative classifiers that
rank-aggregates their predictions via a Borda count-based consensus. Our
proposed approach uses a subset of the clean training data and a pre-trained
model, and yet is agnostic to network architectures or the adversarial attack
generation method. We show extensive experiments to establish that our defense
strategy achieves state-of-the-art performance on the ImageNet validation set.