Adversarial training is an effective methodology for training deep neural
networks that are robust against adversarial, norm-bounded perturbations.
However, the computational cost of adversarial training grows prohibitively as
the size of the model and number of input dimensions increase. Further,
training against less expensive and therefore weaker adversaries produces
models that are robust against weak attacks but break down under attacks that
are stronger. This is often attributed to the phenomenon of gradient
obfuscation; such models have a highly non-linear loss surface in the vicinity
of training examples, making it hard for gradient-based attacks to succeed even
though adversarial examples still exist. In this work, we introduce a novel
regularizer that encourages the loss to behave linearly in the vicinity of the
training data, thereby penalizing gradient obfuscation while encouraging
robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that
models trained with our regularizer avoid gradient obfuscation and can be
trained significantly faster than adversarial training. Using this regularizer,
we exceed current state of the art and achieve 47% adversarial accuracy for
ImageNet with l-infinity adversarial perturbations of radius 4/255 under an
untargeted, strong, white-box attack. Additionally, we match state of the art
results for CIFAR-10 at 8/255.