Neural networks are known to be vulnerable to adversarial attacks -- slight
but carefully constructed perturbations of the inputs which can drastically
impair the network's performance. Many defense methods have been proposed for
improving robustness of deep networks by training them on adversarially
perturbed inputs. However, these models often remain vulnerable to new types of
attacks not seen during training, and even to slightly stronger versions of
previously seen attacks. In this work, we propose a novel approach to
adversarial robustness, which builds upon the insights from the domain
adaptation field. Our method, called Adversarial Feature Desensitization (AFD),
aims at learning features that are invariant towards adversarial perturbations
of the inputs. This is achieved through a game where we learn features that are
both predictive and robust (insensitive to adversarial attacks), i.e. cannot be
used to discriminate between natural and adversarial data. Empirical results on
several benchmarks demonstrate the effectiveness of the proposed approach
against a wide range of attack types and attack strengths. Our code is
available at https://github.com/BashivanLab/afd.