Synthesia AI | Deep fakes get respectable

Posted on Oct 28, 2019 by Neal Romanek

Synthesia AI: A new start-up, Synthesia, aims to put AI-driven video production into the hands of creators

If Synthesia has its way, no one will know that Taylor Swift doesn’t speak perfect Mandarin.

The company, formed three years ago by a group of high-level researchers and entrepreneurs from UCL, Stanford, Technical University of Munich, and Cambridge, has developed AI-powered video augmentation technology. The company, funded by venture capitalists from the US and UK, hopes that their tools will give creators ‘superpowers’ when it comes to manipulating video.

Synthesia’s calling card has been a public service announcement for UK charity Malaria Must Die, in which football superstar David Beckham appears to deliver a message about the ravages of the deadly disease in nine languages, including Hindi, Spanish, Arabic and Mandarin. The voices are native speakers who have been affected by the disease, but Beckham’s mouth movement matches their lines. This effect could probably be approximated with simple dubbing, but Synthesia’s ENACT product creates essentially perfect lip-sync for each of the languages spoken, making it look, at the very least, like Beckham has been up all night making sure his non-English lines are perfect.

The production is simple: David Beckham sits down at a table with a cup of tea and speaks into camera. It’s essentially one shot with Beckham reading off a teleprompter. Every effort has been made to keep the shoot short and sweet. Beckham is one of the founding members of the Malaria No More UK Leadership Council, but it’s still a non-starter if he has to spend hours learning lines in multiple languages and then try to deliver them in a single take. Synthesia’s ENACT technology enabled the footballer to deliver lines in English, and then create his lip-sync from scratch based on the other foreign language voiceovers – and with greater speed and for lower cost than a traditional visual effects house would have done it.

Video Gaga: In the future video may be created in radically different ways, and that will change the ways it is used and consumed

Synthesia AI: Avatars for avatars

Synthesia has already developed a body of experience with the technology, using it in elearning, ‘rewriting’ a video of a speaker delivering a talk into a different language. Company co-founder and CEO, Victor Riparbelli, explains: “Instead of dubbing or subtitling it, which generally is a pretty bad experience, we have a voice actor perform that same video in a different language. We create a digital version of the real actor, and then we can make it look like the presenter is really speaking that different language.”

I think in two to four years, we’ll all have our own avatars for various purposes

The video can be further customised, depending on who the audience is, presenting different types of content depending on the circumstances.

“We could take a video and switch out the name, switch out the city, switch out the product, and it would be just like when you get an email personalised for you. In the case of an email, it’s a bit of text inserted programmatically. But programmatically creating videos, where you address someone with information specific to them, we can do at scale today. We have an engine for that,” Riparbelli adds.

The next step, Riparbelli says, is to incorporate interactivity, where a viewer or user can engage in real time with the video, even ask it questions, and get back appropriate answers that appear to be delivered as spontaneous conversation.

“The range of applications is really wide. I view this as a foundational technology. What we’re doing today is very much around creating digital humans. We’re tracking towards being able to give Siri or Alexa a face, but one that is actually real. I think in two to four years, we’ll all have our own avatars for various purposes.” Riparbelli points to the latest update to FaceTime, which corrects eye placement so it appears that callers are looking directly at the camera – and so at each other – rather than looking at their screens. “That’s obviously a very small thing,” he says. “But that’s the first step.”

A recent ‘deep fake’ trial revolved around generating new video, based solely on an audio track, without any additional video reference. One of Riparbelli’s co-founders, Professor Matthias Niessner, was involved in the experiment and the subsequent paper.

Being able to customise video for each individual could lead to services like customised newscasts.

“You could programmatically create a newscast,” says Riparbelli. “If you go on Google News, it will show you 20 articles that are most relevant for you. For our system, we do the same thing. But instead of 20 text articles, it would just be a video at six minutes long, that would summarise each of the articles with a presenter selected for you.”

Can’t believe your eyes

Riparbelli is well aware of the ethical dilemmas raised by seamless alteration of real-life video – and of the public anxiety around it. Synthesia’s company website lists ‘Ethics’ in its main menu along with ‘Solutions’ and ‘Get in touch’.

The site states: ‘Soon these and other sophisticated technologies will be widespread and it is important that the risk of misuse is both acknowledged and alleviated as much as possible. In our view, the best approach is to create public awareness and develop technological security mechanisms to ensure all content created is consensual’.

When it becomes commonly accepted that the person we see on-screen may not even be a real person at all, or has at least been digitally manipulated, what will happen to our trust in media and in each other? And in a world where every image is suspect, how can we determine what is real and what isn’t?

Riparbelli is optimistic about human perceptiveness and about new technologies that can aid in sorting truth from something manufactured. He thinks that the fears around deep fakes are mostly overblown and notes that most online propaganda and ‘fake news’ is not generated out of whole cloth, but is often merely manipulating the context of already existing material. We all know that opting for a frowning picture of the subject of an article versus a smiling one can dramatically alter how a person perceives that person. Or simply leaving a three second silence after a person has spoken, rather than cutting away, can be the difference between making someone seem decisive or stupid.

Media provenance is something that needs to be solved by several different parties

That there should be some kind of fingerprinting of video content allowing users to track what they’ve seen back to its original source is a growing consensus Riparbelli supports.

“A system like that could tell you: ‘These 30 seconds of material in this certain video came 100% from the BBC. And if you want to watch the rest of the original video – not just the 30 seconds this person has picked out – then here’s a link’. That’s missing today. Unless you do thorough research, and we know that people don’t, anyone can take a short segment of a video, change the context and suddenly somebody might seem really racist. You want to have a system in place that can at least assist with that. YouTube automatically lists songs that are in a video – they have to for the record labels. Every time a video is uploaded to YouTube, they will figure out what songs are in it. The same thing for video would be really cool.

“I’m really trying to spur collaboration between us and big tech and news organisations, because I think this is the general sort of problem. Media provenance is something that needs to be solved by several different parties.”

Synthesia is counting on its tools being able to dramatically expand what creators can do. Riparbelli sees a future where human performances can be created and edited on a laptop with as much ease as we currently edit sound and image.

It seems clear that the era of accepting video at face value is nearly over. If Riparelli and his co-founders are on the right track, video is likely to become as malleable as text. While our ability to use video for communication and creative expression will increase astronomically, its status as the go-to medium for authenticity may vanish altogether.

This article originally appeared in the September 2019 issue of FEED magazine.

AI and machine learning archives