Deep networks are well-known to be fragile to adversarial attacks. We conduct
an empirical analysis of deep representations under the state-of-the-art attack
method called PGD, and find that the attack causes the internal representation
to shift closer to the "false" class. Motivated by this observation, we propose
to regularize the representation space under attack with metric learning to
produce more robust classifiers. By carefully sampling examples for metric
learning, our learned representation not only increases robustness, but also
detects previously unseen adversarial samples. Quantitative experiments show
improvement of robustness accuracy by up to 4% and detection efficiency by up
to 6% according to Area Under Curve score over prior work. The code of our work
is available at
https://github.com/columbia/Metric_Learning_Adversarial_Robustness.