Adversarial examples (AEs) are crafted by adding human-imperceptible
perturbations to inputs such that a machine-learning based classifier
incorrectly labels them. They have become a severe threat to the
trustworthiness of machine learning. While AEs in the image domain have been
well studied, audio AEs are less investigated. Recently, multiple techniques
are proposed to generate audio AEs, which makes countermeasures against them an
urgent task. Our experiments show that, given an AE, the transcription results
by different Automatic Speech Recognition (ASR) systems differ significantly,
as they use different architectures, parameters, and training datasets.
Inspired by Multiversion Programming, we propose a novel audio AE detection
approach, which utilizes multiple off-the-shelf ASR systems to determine
whether an audio input is an AE. The evaluation shows that the detection
achieves accuracies over 98.6%.