Lea Schönherr (Ruhr University Bochum), Katharina Kohls (Ruhr University Bochum), Steffen Zeiler (Ruhr University Bochum), Thorsten Holz (Ruhr University Bochum), Dorothea Kolossa (Ruhr University Bochum)

Voice interfaces are becoming accepted widely as input methods for a diverse set of devices. This development is driven by rapid improvements in automatic speech recognition (ASR), which now performs on par with human listening in many tasks. These improvements base on an ongoing evolution of deep neural networks (DNNs) as the computational core of ASR. However, recent research results show that DNNs are vulnerable to adversarial perturbations, which allow attackers to force the transcription into a malicious output.

In this paper, we introduce a new type of adversarial examples based on *psychoacoustic hiding*. Our attack exploits the characteristics of DNN-based ASR systems, where we extend the original analysis procedure by an additional backpropagation step. We use this backpropagation to learn the degrees of freedom for the adversarial perturbation of the input signal, i.e., we apply a psychoacoustic model and manipulate the acoustic signal below the thresholds of human perception. To further minimize perceptibility of the perturbations, we use forced alignment to find the best fitting temporal alignment between the original audio sample and the malicious target transcription. These extensions allow us to embed an arbitrary audio input with a malicious voice command that is then transcribed by the ASR system, with the audio signal remaining barely distinguishable from the original signal. In an experimental evaluation, we attack the state-of-the-art speech recognition system *Kaldi* and determine the best performing parameter and analysis setup for different types of input. Our results show that we are successful in up to 98% of cases with a computational effort of fewer than two minutes for a ten-second audio file. Based on user studies, we found that none of our target transcriptions were audible to human listeners, who still understand the original speech content with unchanged accuracy.

View More Papers

We Value Your Privacy ... Now Take Some Cookies:...

Martin Degeling (Ruhr-Universität Bochum), Christine Utz (Ruhr-Universität Bochum), Christopher Lentzsch (Ruhr-Universität Bochum), Henry Hosseini (Ruhr-Universität Bochum), Florian Schaub (University of Michigan), Thorsten Holz (Ruhr-Universität Bochum)

Read More

Unveiling your keystrokes: A Cache-based Side-channel Attack on Graphics...

Daimeng Wang (University of California Riverside), Ajaya Neupane (University of California Riverside), Zhiyun Qian (University of California Riverside), Nael Abu-Ghazaleh (University of California Riverside), Srikanth V. Krishnamurthy (University of California Riverside), Edward J. M. Colbert (Virginia Tech), Paul Yu (U.S. Army Research Lab (ARL))

Read More

Measuring the Facebook Advertising Ecosystem

Athanasios Andreou (EURECOM), Márcio Silva (UFMG), Fabrício Benevenuto (UFMG), Oana Goga (Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG), Patrick Loiseau (Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG & MPI-SWS), Alan Mislove (Northeastern University)

Read More

PeriScope: An Effective Probing and Fuzzing Framework for the...

Dokyung Song (University of California, Irvine), Felicitas Hetzelt (Technical University of Berlin), Dipanjan Das (University of California, Santa Barbara), Chad Spensky (University of California, Santa Barbara), Yeoul Na (University of California, Irvine), Stijn Volckaert (University of California, Irvine and KU Leuven), Giovanni Vigna (University of California, Santa Barbara), Christopher Kruegel (University of California, Santa Barbara),…

Read More