Synthetic voices

License your voice: AI solution paves way for synthetic vocals

Image credit: Getty Images

An AI company in the US has launched a hyper-realistic Voice-as-a-Service (VaaS) solution that enables users to ‘create, manage, license and monetise’ synthetic voices.

Synthetic voices have increasingly become ubiquitous in our everyday lives, from the news we hear on our smart speakers to giving us directions to a desired destination. Meanwhile, the need for media companies, brands, and celebrities to have a ‘voice’ representing them on such growing audio-based mediums continues to expand.

Many brands and media companies are struggling to produce content at the rapid rate consumers expect, because of constraints related to talent, production, personalisation and licensing, according to California-based AI company Veritone. To challenge this, Veritone launched a platform, called Marvel.AI, that enables content creators and media figures, among others, to generate deepfake clones of their voice to license whenever they so please.

According to Veritone’s president Ryan Steelberg, the technology opens the door for celebrities and public figures to ‘digitally lend’ their voices not just for ads, but also for podcasts, audiobooks, video games and voice-over narration, without having to set foot inside a recording studio. “With complete control over their voice and its usage, any influencer, personality, or celebrity can literally be in multiple places at once,” he explains. The generated synthetic voice can also change its tone, speed, accent and pitch, and be made to speak a different language.

Before the launch of the platform, Veritone had been working with TV and radio company CBS News to repurpose a lot of the broadcasters’ audio for podcasts when Steelberg acknowledged the issues encountered with trying to put together audio for these podcasts. “Sometimes there was not enough content in a format that we could edit,” he tells E&T. “It was very challenging for us to find the host, or presenter, and get them back into the studio to re-edit or re-dub parts of their programming right using their own voice.”

Veritone wondered if the creation of synthetic voice models for this talent would help resolve this occurring issue. “What if once we create these models, could we polish them, and make them hyper realistic and accurate enough that we could do post-production work in auto-dubbing, using a synthetic version of such actor’s voice without having to them to get back in the studio,” Steelberg says.

This was what fuelled the company’s ambitions to become a Voice-as-a-Service (VaaS) partner for these media companies. “We wanted to become a platform for anybody where we, in effect, would help manage and nurture that synthetic voice, whether that’s an individual who wants to work with Marvel.AI, or an entire broadcast network that wants us to help manage synthetic voice models of their talent,” he explains.

Indeed, Veritone said that Marvel.AI supports both text-to-speech and speech-to-speech processes and offers the first complete, end-to-end suite of voice capabilities and features. Built on aiWARE, Veritone’s proprietary operating system for AI, Marvel.AI also enables users to leverage multiple best-in-class voice engines, ensuring they use the best solution possible for their specific needs.

The process, also known as voice cloning, enables accurate replication of a person’s voice using artificial intelligence (AI), Steelberg says – the platform analyses audio samples using a machine-learning algorithm trained on audio samples of a particular person’s voice.

Steelberg explains that, when building custom voices, the talent in question must give their written and verbal consent to proceed with the process. In fact, Veritone are working with the Open Voice Network, a non-profit industry association dedicated to the development of the standards and ethical use guidelines on the use of voice on different platforms, to help protect a talent’s voice brand. “Once we have a person’s written and verbal consent, we can use that consent in audio to verify the training data and the collection of this date to verify it really is them,” Steelberg says.

Once the system collects enough audio recordings, Marvel.AI can produce audio files using the host’s synthetic voice in one of several ways, Steelberg explains. One is text-to-speech, where users import text into the synthetic voice model and the machine creates audio using a synthetic version of the talents’ voice – what started as text ends up as an audio file. Meanwhile, speech-to-speech uses the same training data, only the input is a voice actor imitating the “tone and inflection that makes people’s signature voices unique”, Steelberg adds.

Veritone has the resources to ingest that content, and expand and isolate the talent’s voice, Steelberg says. And when the company’s system has identified enough quality training data – from previous interviews, podcasts, and videos the talent has done over their career – Veritone then uses a combination of different neural networks to generate a voice model. “Ultimately, we will generate a secure voice model, an AI model that is only made available to those with accurate credentials in a secure cloud environment.”

Steelberg says that once they have made a voice model for a talent, only the person in question, or a duly authorised person, would have access to use that model to create new content – the secure cloud environment ensures bad actors don’t misappropriate the voice model. “We index, cut and fingerprint that voice model,” he explains further. “And once audio files are produced, Veritone embeds inaudible ‘tokens’ into that audio file to verify it. So if we wanted to verify whether a voice came from Marvel.AI, we can audit it all the way back to the original training data that we initially collected.”

The platform’s potential applications are still being explored, but Marvel.AI can produce hundreds of derivatives of content. Steelberg says there is a high demand, since the platform launched back in May, in creating ad copy. “Broadcasting companies can produce one piece of national copy and they can regionalise it for 50 different markets or different dialects,” he says. “The key is understanding the use case the talent is trying to achieve and then use different modalities for that execution.”

Veritone also believes that Marvel.AI can enable listeners to have virtual chats with their favourite player from a sports team, for example. “In a conversational AI approach, a listener can interact through their computer or over the phone, having a conversation, asking questions, and the feedback would come back in a trained voice from their favourite footballer or basketball player, as examples,” he explains.

The company is also working with the US Interactive Advertising Bureau on a clear but unobtrusive way to inform listeners that a synthetic voice is being used and to ensure listeners that the voice’s owners endorse its use. In media with a visual, they would display the disclosure at the bottom of the screen, similar to disclosures about paid sponsorships. Meanwhile, for a purely audio medium, it would be a 2-5 second audible tone. “We want to reassure listeners that celebrities and public figures really approved this synthetic content,” Steelberg says.

Steelberg also believes that the technology could expand beyond the means of advertising. Companies could also use it in forms of communications, and to improve accessibility for visually impaired users. Steelberg uses the example of sending a text to a computer or phone that is read out loud. “If you send me a text and I’m using a text-to-speech editor, it will come out in a robotic voice. Why not have the voice actually be you?” This might create a more accessible and intimate experience for the receiver of the text.

The company hopes synthetic voices will lead to a lot more original storylines in original audio casts being produced purely programmatically that can be tailored to specific people, or organisations, interests, and needs. Steelberg envisions a future whereby “the art of the possible is programmatic content generation” and technology evolves to create new and innovative ways to produce content.

Samples of Ryan Steelberg's real and synthetic voices - including examples of female and Spanish synthetic voice.

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles