Adversarial classification is the task of performing robust classification in
the presence of a strategic attacker. Originating from information hiding and
multimedia forensics, adversarial classification recently received a lot of
attention in a broader security context. In the domain of machine
learning-based image classification, adversarial classification can be
interpreted as detecting so-called adversarial examples, which are slightly
altered versions of benign images. They are specifically crafted to be
misclassified with a very high probability by the classifier under attack.
Neural networks, which dominate among modern image classifiers, have been shown
to be especially vulnerable to these adversarial examples.
However, detecting subtle changes in digital images has always been the goal
of multimedia forensics and steganalysis. In this paper, we highlight the
parallels between these two fields and secure machine learning.
Furthermore, we adapt a linear filter, similar to early steganalysis methods,
to detect adversarial examples that are generated with the projected gradient
descent (PGD) method, the state-of-the-art algorithm for this task. We test our
method on the MNIST database and show for several parameter combinations of PGD
that our method can reliably detect adversarial examples.
Additionally, the combination of adversarial re-training and our detection
method effectively reduces the attack surface of attacks against neural
networks. Thus, we conclude that adversarial examples for image classification
possibly do not withstand detection methods from steganalysis, and future work
should explore the effectiveness of known techniques from multimedia forensics
in other adversarial settings.