Dapper robot about to play a grand piano

Robot voices emote, even with minimal training data

Image credit: Dreamstime

Researchers from the University of California-San Diego have presented a new method for making AI-generated voices, such as those used for virtual assistants, more expressive, while requiring only a minimal amount of training. The technique, which translates text to speech, can be applied to voices that never formed part of the system’s training set.

In addition to improving smartphones, smart home devices, and navigation systems, the method could help improve voiceovers in animated films, automatic translation of speech in multiple languages, and more applications. It could also help to create personalised speech interfaces that provide a digital voice for people who have lost the ability to speak, such as the computer speech interface used by the late Stephen Hawking.

“We have been working in this area for a fairly long period of time,” said PhD candidate Shehzeen Hussain, who is based at the university’s School of Engineering. “We wanted to look at the challenge of not just synthesising speech but of adding expression meaning to that speech.”

According to the researchers, existing methods fall short in two major ways. Some systems can synthesise expression speech for a speaker, using hours of training data for that specific individual. Others can synthesise speech from only a few minutes of speech data from a new speaker, but cannot generate expressive speech, only translating text to relatively monotonous speech.

By contrast, this approach can generate expressive speech for a new subject.

The researchers flagged the pitch and rhythm of the speech in training samples, as a proxy for emotion. This allowed their cloning system to generate expressive speech with minimal training, even for voices it had never encountered before.

Writing in the study describing the approach, they said: “We demonstrate that our proposed model can make a new voice express, emote, sing, or copy the style of a given reference speech.”

Their method can learn speech directly from text, reconstruct a speech sample from a target speaker, and transfer the pitch and rhythm of speech from a different expressive speaker into cloned speech for the target speaker. 

The team is aware that their work could be used to make deepfake videos and audio clips more accurate and persuasive. Therefore, they plan to release their code with a watermark that will identify the speech created by their method as cloned. 

“Expressive voice cloning would become a threat if you could make natural intonations,” said Paarth Neekhara, a lead author and a PhD candidate in computer science. “The more important challenge to address is detection of these media and we will be focusing on that next.” 

The method itself still needs to be improved, the researchers say, noting that it works better for English speakers and struggles with speakers with a strong accent. Samples of AI-generated speech using this approach can be heard here.

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles