
Smart speaker hack could trigger secret messages hidden in audio files
Image credit: Dreamstime
Vulnerabilities have been found in smart home devices like Amazon’s Echo and Google Home that allow hackers to broadcast secret messages via digital audio that cannot be heard or understood by humans.
A team from Ruhr-Universität Bochum succeeded in integrating secret commands for the Kaldi speech-recognition system – the software believed to be contained in Amazon’s Alexa and many other smart systems like it - into audio files.
While the commands are not audible to the human ear, Kaldi reacts to them and the researchers showed that they could hide any sentence they liked in different types of audio signals, such as speech, birds’ twittering or music and they would be understood.
“A virtual assistant that can carry out online orders is one of many examples where such an attack could be exploited,” said professor Thorsten Holz who worked on the project. “We could manipulate an audio file, such as a song played on the radio, to contain a command to purchase a particular product.”
Similar attacks, known as adversarial examples in technical jargon, were already described a few years ago for image-recognition software. They are more complicated to implement for speech signals, as the meaning of an audio signal only emerges over time and becomes a sentence.
In order to incorporate the commands into the audio signals, the researchers use the psychoacoustic model of hearing or, more precisely, the masking effect, which is dependent on volume and frequency.
“When the auditory system is busy processing a loud sound of a certain frequency, we are no longer able to perceive other, quieter sounds at this frequency for a few milliseconds,” said researchers Dorothea Kolossa.
This fact is also used in the MP3 format, which omits inaudible areas to minimise file size. It was in these areas that the researchers hid the commands for the voice assistant.
For humans, the added components sound like random noise that is not (or hardly) noticeable in the overall signal. For the machine, however, it changes the meaning. While the human hears statement A, the machine understands statement B.
The calculations for adding hidden information to ten seconds of an audio file take less than two minutes and are thus much faster than previously described attacks on speech-recognition systems.
The researchers have not yet carried out any attacks over the air. Instead, they have passed the manipulated audio files directly to Kaldi as input data.
In future studies, they want to show that the attack also works when the signal is played through a loudspeaker and reaches the voice assistant through the air.
“Due to the background noise, the attack will no longer be quite as efficient,” Lea Schönherr suspects, “but we assume that it will still work.”
The aim of the research is to make speech-recognition assistants more robust against attacks over the long term.
For the attack presented here, it is conceivable that the systems could calculate which parts of an audio signal are inaudible to humans and remove them.
“However, there are certainly other ways to hide the secret commands in the files besides the MP3 principle,” Kolossa said. Again, these would require other protection mechanisms.
Holz said he believes that there is no present cause for concern regarding the potential for danger.
“Our attack does not yet work via the air interface,” he said. “In addition, speech-recognition assistants are not currently used in safety-relevant areas, but are only for convenience.
“Nevertheless, we must continue to work on the protection mechanisms as the systems become more sophisticated and popular.”
Last week, Amazon said it would update Alexa so that it understands when users are whispering, can detect the sound of smashing glass and other features designed to improve its feature set.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.