<p>Visit the <a href="https://www.georgetown.edu/operating-status">operating status page</a> for information on the university's current operating status.</p>
View of stained glass with the Georgetown University seal

Hidden Voice Commands Could Attack Your Smartphone, Study Shows

July 1, 2016 – Devices such as a smartphone can be attacked through hidden voice commands that are not understandable to humans, according to a new study by Georgetown and University of California, Berkeley, computer scientists.

A paper describing how this can happen will be presented at the prestigious USENIX Security Symposium taking place in Austin, Texas, in August.

Micah Sherr smiles in a headshot.“Voice command systems are becoming ubiquitous,” notes Micah Sherr, a computer science department professor who worked with colleagues Clay Shields and Wenchao Zhou on the project. “The attack we envision as most feasible is that someone has a YouTube video of kittens or something popular and in the background, there’s something that says, open a URL.”

Kitten Attack

If an attacker can make your phone open a particular website, he explains, that website could have malware on it.

A photo of a grey, black, and white kitten with its arms outstretched.“So a possible scenario could be that a million people watch a kitten video, and 10,000 of them have their phones nearby and 5,000 of those phones obey the attacker’s voice commands and load a URL with malware on it,” Sherr says. “Then you have 5,000 smartphones under an attacker’s control.”

The researchers, which also include Georgetown Ph.D. students Tavish Vaidya and Yuankai Zhang, used their knowledge about how speech recognition systems work to construct audio recordings that can be understood as speech by computers but lack the necessary resolution for human comprehension. 

“We learned that if you remove those parts and keep everything else, you get something that a computer can still understand but the human brain cannot,” Sherr explains. 

Human vs. Machine

The team also came up with a number of defenses.

One defense would alert the smartphone owner with a tone, but Sherr points out wouldn’t work if the person being attacked were watching a movie.

So the team also spent some time creating some artificial intelligent machine learning techniques that would help computers to recognize the difference between human and machine-produced speech.

A photo of a black tablet with a menacing red symbol on the screen. “Unfortunately, since speech recognition systems are evolving over time, we can’t say that this will protect against all non-foreseeable attacks in the future,” he says. “What we’ve done is come up with an interesting attack that hasn’t been looked at before.”

Spreading the Word

While some might wonder why researchers would reveal the workings behind a new attack that hackers might not have thought of, Sherr says this considered responsible disclosure within the field of computer security.

“It’s kind of counterintuitive,” he notes, “But the worst thing you can do is sit there and don’t tell anybody, because then people who are criminal will take advantage of the system and no one knows about it. It’s better to describe an attack but also give people fair warning.”

Sherr, who has received numerous National Science Foundation grants for his work, also directs the interdisciplinary Georgetown Institute for Information Assurance, a National Security Agency Center for Academic Excellence in Cyber Defense Research.

Imperfect Science

The technology companies, he adds, are just trying to cater to the public and may not be willing to trade security for lack of sales.

A photo of the word "Voice" written like a wavelength. “Voice recognition is an imperfect science, as anybody who’s used Siri or any of these systems knows,” Sherr says. “The companies need to err on the side of being liberal in terms of what they accept as speech because if they were more conservative and filtered out what might not be speech, the accuracy of the system would fall to the level that no one would use it.”

Training a device to recognize nothing but the owner’s speech could take long periods of time, and people are usually unwilling to do that.

“I imagine that this is going to be somewhat of an arms race,” he says. “As speech recognition gets better and perhaps more closely models what the human brain does, there may be a different set of attacks.”