This paper proposes a classification framework with a rejection option to
mitigate the performance deterioration caused by adversarial examples. While
recent machine learning algorithms achieve high prediction performance, they
are empirically vulnerable to adversarial examples, which are slightly
perturbed data samples that are wrongly classified. In real-world applications,
adversarial attacks using such adversarial examples could cause serious
problems. To this end, various methods are proposed to obtain a classifier that
is robust against adversarial examples. Adversarial training is one of them,
which trains a classifier to minimize the worst-case loss under adversarial
attacks. In this paper, in order to acquire a more reliable classifier against
adversarial attacks, we propose the method of Adversarial Training with a
Rejection Option (ATRO). Applying the adversarial training objective to both a
classifier and a rejection function simultaneously, classifiers trained by ATRO
can choose to abstain from classification when it has insufficient confidence
to classify a test data point. We examine the feasibility of the framework
using the surrogate maximum hinge loss and establish a generalization bound for
linear models. Furthermore, we empirically confirmed the effectiveness of ATRO
using various models and real-world datasets.