These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Deep Neural Networks (DNNs) are often criticized for being susceptible to
adversarial attacks. Most successful defense strategies adopt adversarial
training or random input transformations that typically require retraining or
fine-tuning the model to achieve reasonable performance. In this work, our
investigations of intermediate representations of a pre-trained DNN lead to an
interesting discovery pointing to intrinsic robustness to adversarial attacks.
We find that we can learn a generative classifier by statistically
characterizing the neural response of an intermediate layer to clean training
samples. The predictions of multiple such intermediate-layer based classifiers,
when aggregated, show unexpected robustness to adversarial attacks.
Specifically, we devise an ensemble of these generative classifiers that
rank-aggregates their predictions via a Borda count-based consensus. Our
proposed approach uses a subset of the clean training data and a pre-trained
model, and yet is agnostic to network architectures or the adversarial attack
generation method. We show extensive experiments to establish that our defense
strategy achieves state-of-the-art performance on the ImageNet validation set.