These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Despite remarkable improvements, automatic speech recognition is susceptible
to adversarial perturbations. Compared to standard machine learning
architectures, these attacks are significantly more challenging, especially
since the inputs to a speech recognition system are time series that contain
both acoustic and linguistic properties of speech. Extracting all
recognition-relevant information requires more complex pipelines and an
ensemble of specialized components. Consequently, an attacker needs to consider
the entire pipeline. In this paper, we present VENOMAVE, the first
training-time poisoning attack against speech recognition. Similar to the
predominantly studied evasion attacks, we pursue the same goal: leading the
system to an incorrect and attacker-chosen transcription of a target audio
waveform. In contrast to evasion attacks, however, we assume that the attacker
can only manipulate a small part of the training data without altering the
target audio waveform at runtime. We evaluate our attack on two datasets:
TIDIGITS and Speech Commands. When poisoning less than 0.17% of the dataset,
VENOMAVE achieves attack success rates of more than 80.0%, without access to
the victim's network architecture or hyperparameters. In a more realistic
scenario, when the target audio waveform is played over the air in different
rooms, VENOMAVE maintains a success rate of up to 73.3%. Finally, VENOMAVE
achieves an attack transferability rate of 36.4% between two different model
architectures.