Adversarial attacks on machine learning-based classifiers, along with defense
mechanisms, have been widely studied in the context of single-label
classification problems. In this paper, we shift the attention to multi-label
classification, where the availability of domain knowledge on the relationships
among the considered classes may offer a natural way to spot incoherent
predictions, i.e., predictions associated to adversarial examples lying outside
of the training data distribution. We explore this intuition in a framework in
which first-order logic knowledge is converted into constraints and injected
into a semi-supervised learning problem. Within this setting, the constrained
classifier learns to fulfill the domain knowledge over the marginal
distribution, and can naturally reject samples with incoherent predictions.
Even though our method does not exploit any knowledge of attacks during
training, our experimental analysis surprisingly unveils that domain-knowledge
constraints can help detect adversarial examples effectively, especially if
such constraints are not known to the attacker.