Automated lip-syncing made possible with machine learning
Image credit: Dreamtime
Researchers from the University of Washington have developed a more realistic method for transforming audio clips into lip-synced video, using just hours of footage to train a neural network to match sounds to lip movements.
This solves a long-standing problem, whereby synthesised human appearances are often considered unsettling and fake-looking. This is particularly potent when there is clearly computerised animation around the mouth area matched to an audio recording.
While previous attempts at matching video to audio have involved filming multiple people repeating the same phrases while being filmed in a studio as investigators attempt to capture how sound matches mouth shapes, the University of Washington researchers left this task up to a neural network.
The machine learning programme was trained to recognise the relationships between phonemes and mouth movements using hours of matched video and audio of a person speaking. Once trained, the program is capable of converting recorded speech into appropriate mouth shapes. These shapes are then superimposed and blended into the face of the speaker, from another video.
Using their new method, the researchers were able to generate realistic video footage of former US President Barack Obama speaking about fatherhood, terrorism and other subjects, using video footage from completely different speeches.
“These types of results have never been shown before,” said Professor Kemelmacher-Shlizerman.
“Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.”
The team chose to use Obama as a subject due to the ease of collecting 14 hours of presidential videos in the public domain, although they say that video chat tools like Skype and Google Hangouts could collect video and audio from any user to train the software. As streaming audio over the internet takes less bandwidth than video, the system could put an end to juddering, low-resolution video calls by matching the audio to the user’s image.
Adjusting the algorithms to generalise across situations could allow the training to require a far smaller dataset; it could potentially begin to match mouth shapes to speech with only an hour of video.