Automatic speech recognition and voice identification systems are being
deployed in a wide array of applications, from providing control mechanisms to
devices lacking traditional interfaces, to the automatic transcription of
conversations and authentication of users. Many of these applications have
significant security and privacy considerations. We develop attacks that force
mistranscription and misidentification in state of the art systems, with
minimal impact on human comprehension. Processing pipelines for modern systems
are comprised of signal preprocessing and feature extraction steps, whose
output is fed to a machine-learned model. Prior work has focused on the models,
using white-box knowledge to tailor model-specific attacks. We focus on the
pipeline stages before the models, which (unlike the models) are quite similar
across systems. As such, our attacks are black-box and transferable, and
demonstrably achieve mistranscription and misidentification rates as high as
100% by modifying only a few frames of audio. We perform a study via Amazon
Mechanical Turk demonstrating that there is no statistically significant
difference between human perception of regular and perturbed audio. Our
findings suggest that models may learn aspects of speech that are generally not
perceived by human subjects, but that are crucial for model accuracy. We also
find that certain English language phonemes (in particular, vowels) are
significantly more susceptible to our attack. We show that the attacks are
effective when mounted over cellular networks, where signals are subject to
degradation due to transcoding, jitter, and packet loss.