Silver BlogExploring DeepFakes

In this post, I explore the capabilities of this tech, describe how it works, and discuss potential applications.



How does it work?

 
At the core of the Deepfakes code is an autoencoder, a deep neural network that learns how to take an input, compress it down into a small representation or encoding, and then to regenerate the original input from this encoding.


In this standard autoencoder setup, the network is trying to learn how to create an encoding (the bits in the middle), from which it can regenerate the original image. With enough data, it will learn how to do this.

Putting a bottleneck in the middle forces the network to recreate these images instead of just returning what it sees. The encodings help it capture broader patterns, hypothetically, like how and where to draw Jimmy Fallon’s eyebrow.

Deepfakes goes further by having one encoder to compress a face into an encoding, and two decoders, one to turn it back into person A (Fallon), and the other to person B (Oliver). It’s easier to understand with a diagram:


There is only one encoder that is shared between the Fallon and Oliver cases, but the decoders are different. During training, the input faces are warped, to simulate the notion of “we want a face kind of like this”.

In the above, we’re showing how these 3 components get trained:

  1. We pass in a warped image of Fallon to the encoder and try to reconstruct Fallon’s face with Decoder A. This forces Decoder A to learn how to create Fallon’s face from a noisy input.
  2. Then, using the same encoder, we encode a warped version of Oliver’s face and try to reconstruct it using Decoder B.
  3. We keep doing this over and over again until the two decoders can create their respective faces, and the encoder has learned how to “capture the essence of a face” whether it be Fallon’s or Oliver’s.

Once training is complete, we can perform a clever trick: pass in an image of Fallon into the encoder, and then instead of trying to reconstruct Fallon from the encoding, we now pass it to Decoder B to reconstruct Oliver.


This is how we run the model to generate images. The encoder captures the essence of Fallon’s face and gives it to Decoder B, which says “ah, another noisy input, but I’ve learned how to turn this into Oliver… voila!”

It’s remarkable to think that the algorithm can learn how to generate these images just by seeing thousands of examples, but that’s exactly what has happened here, and with fairly decent results.

 

Limitations and learnings

 
While the results are exciting, there are clear limitations to what we can achieve with this technology today:

  • It only works if there are lots of images of the target: to put a person into a video, on the order of 300–2000 images of their face are needed so the network can learn how to recreate it. The number depends on how varied the faces are, and how closely they match the original video.

    → This works fine for celebs, or anyone with lots of photos online. But clearly, this won’t let you build a product that can swap anybody’s face.

  • You also need training data that is representative of the goal: the algorithm isn’t good at generating profile shots of Oliver, simply because it didn’t have many examples of Oliver looking to the side. In general, training images of your target need to approximate the orientation, facial expressions, and lighting in the videos you want to stick them into.

    → So if you’re building a face swap tool for the average person, given that the most photos of them will be front-facing (e.g., selfies on Instagram), limit face swaps to mostly forward facing videos. If you’re working with a celeb it’s easier to get a diverse set of images. And if your target is helping you create data, go for quantity and variety so you can insert them into anything.


There just weren’t that many images of Oliver looking to the side, so the network was never able to learn by example and generate profile poses.

  • Building models can get expensive: it took 48 hours to get ok results, and 72 to get the fair ones above. At about $0.50 a GPU hour, it will cost $36 to build a model just for swapping person A to B and vice versa, and that doesn’t include all the bandwidth needed to get training data, and CPU and I/O to preprocess it. The biggest issue: you need one model for every pair of people, so work invested in a single model doesn’t scale to different faces. (Note: I used a single GPU machine; training could be parallelized across GPUs.) → This high model creation cost makes it hard to create a free or cheap app in the hopes it will go viral. Of course, it’s not an issue if customers are willing to pay to generate models.
  • Running models is fairly cheap, but not free: it takes about 5–20X the duration of a video to create a swap when running on a GPU, e.g. a 1 minute 1080p video takes about 18 minutes to generate. A GPU helps speed up the core model, and also the face detection code (i.e. is there a face in this frame that I even need to swap?). I haven’t tried conversion on just a CPU, but I’d venture it would be much slower. → There are a lot of inefficiencies in the current conversion process: waiting on I/O when reading and writing frames, no batching of frames sent to the GPU, no parallelism, etc. This can be made to work much faster, and possibly with a time and cost tradeoff that allows CPUs to be just fine for conversion.
  • Reusing models reduces training time, and therefore cost: if you use the Fallon-to-Oliver model to convert, say, Jimmy Kimmel’s face to John Oliver, the results are typically quite poor. However, if you train a new model with images of Kimmel and Oliver, but start with the Fallon-to-Oliver trained model, it can learn how to perform reasonable conversions in 20–50% of the time (12–36 hours instead of 72+ hours). → If one were to keep a stable of models for different face shapes, and use them as starting points, there may be considerable reductions in time and cost.

These are tenable problems to be sure: tools can be built to collect images from online channels en masse; algorithms can help flag when there is insufficient or mismatched training data; clever optimizations or model reuse can help reduce training time; and a well engineered system can be built to make the entire process automatic.

But ultimately, the question is: why? Is there enough of a business model to make doing all this worth it?

 

So what potential applications are there?

 
Given what’ve now learned about what’s possible, let’s talk about ways in which this could be useful:

Video Content Production

Hollywood has had this technology at its fingertips, but not at this low cost. If they can create great looking videos with this technique, it will change the demand for skilled editors over time.

But it could also open up new opportunities: for instance, making movies with unknown actors, and then superimposing famous celebrities onto them. This could work for YouTube videos or even news channels filmed by regular folks.

In more out-there scenarios, studios could change actors based on their target market (more Schwarzenager for the Austrians), or Netflix could allow viewers to pick actors before hitting play. More likely, this tech could generate revenue for the estates of long dead actors by bringing them back to life.

Deepfakes is the tip of the iceberg where methods of video generation are concerned. The video above demonstrates an approach to construct a controllable 3D model of a person from a photo collection (paper2015). The lead author of that paper has other exciting work worth looking at; currently he’s at Google Brain.

Social Apps

Some of the comment threads on DeepFakes videos on YouTube are abuzz about what a great meme generator this technology could create. Jib Jab is a company that has been selling video greeting cards with simple face swapping for years (they are hilarious). But the big opportunity is to create the next big viral hit; after all photo filters attracted masses of people to Instagram and SnapChat, and face swapping apps have done well before.

Given how fun the results can be, there’s likely room for a hit viral app if you can get the costs low enough to generate these models.


StarGAN is a research paper that demonstrates generating different hair colors, gender, age, and even emotional expression using just an algorithm. I bet an app that lets you generate pouty faces could go viral.

Licensing Celebrity Faces

Imagine if Target could have a celebrity showcase their clothes for a month, just by paying her agent a fee, grabbing some existing headshots, and clicking a button. This would create a new revenue stream for celebrities, social media influencers, or anyone who happens to be in the spotlight at the moment. And it would give businesses another tool to promote brands and drive conversion. It also raises interesting legal questions about ownership of likeness, and business model questions on how to partition and price rights to use them.


Licensing faces is already happening. Looklet is a business that allows apparel companies to: (1) photograph their apparel on a mannequin, (2) pick accompanying clothes, (3) pick a model’s face and pose, and voila, have a glossy image ready for marketing. What’s more: they can restyle at will without models or photographers.

Personalized Advertising

Imagine a world where the ads you see as you surf the web include you, your friends, and your family. While this may come across as creepy today, does it seem so far fetched to think that this won’t be the norm in a few years?

After all, we are visual creatures, and advertisers have been trying to elicit emotional responses from us for years, e.g. Coke may want to convey joy by putting your friends in a hip music video, or Allstate may tug at your fears by showing your family in an insurance ad. Or the approach may be more direct: Banana Republic could superimpose your face on a body type that matches yours, and convince you that it’s worth trying out their new leather jackets.

 

Conclusion

 
Whoever the original Deepfakes user is, they opened a Pandora’s box of difficult questions about how fake video generation will affect society. I hope that in the same way we have come to accept that images can easily be faked, we will adapt to video uncertainty too, though not everyone shares this hope.

What Deepfakes also did is shine a light on how interesting this technology is. Deep generative models like the autoencoder that Deepfakes uses, allow us to create synthetic but realistic looking data (including images or videos), only by showing an algorithm lots of examples. This means that once these algorithms are turned into products, regular folks will have access to powerful tools that will make them more creative, hopefully towards positive ends.

There have already been some interesting applications of this technique, like style transfer apps that make your photos look like famous paintings, but given the high volume and exciting nature of the research that is being published in this space, there’s clearly a lot more to come.

I’m interested in exploring how to build value from the latest in AI research; if you have an interest in taking this technology to market to solve a real problem, please drop me a note.


Examples of results from from a famous paper CycleGAN (ICCV 2017), that show an algorithm learning to turn zebras into horses, winter to summer, and regular photos to famous impressionist paintings.

 

Appendix

 
A few fun tidbits for the curious:

  • FakeApp is a free desktop app that lets you use the Deepfakes technology, provided that you have a GPU and are running Windows. But be warned, downloading and running anything potentially shady on your computer is dangerous, and there has already been much angst about FakeApp including a crypto-miner feature in the app (more here).
  • Derp Fakes is a YouTube channel with lots of funny SFW videos. I especially like this one; it stars Nicholas Cage as everyone in a scene from Lord of the Rings, and is both technically interesting, and hilarious.
  • The original Deepfakes author is unknown (what if it’s Satoshi?… just kidding). It’s also believed that he/she is not the person behind FakeApp. While the author provided the original code, they left several things unclear such as: what license is the code shared under, and what academic paper, if any, inspired his neural network architecture?
  • Deepfakes development and research continues through a community of contributors. Visit the deepfakes/faceswap Github repo to find the latest code. Also check out shaoanlu/faceswap-GAN to see a user research other models (like a GAN), masking techniques (how the generated face is inserted into an image), and even a one-model-to-swap-them-all approach(so you don’t have to create models for each pair of people).
  • If you want to read more about this phenomenon, check out the original Motherboard article that brought this into the public eye; this post where a user of FakeApp swaps his spouse into videos (SFW), and this timeline with a compilation of other good reads. Also, just hours before I published this post, the NYTimes scooped my story! Read it here.
  • If you want to see other related research, check out this video where a user controls another person’s face in realtime; this one where a they generate a realistic video of Obama; and also see Lyrebird, a product that allows you to generate realistic audio of another person, given enough samples of their voice to train on.

 
Bio: Gaurav Oberoi (@goberoi) has been a founder, product manager, and engineer for over a decade in Seattle and Silicon Valley. He is currently exploring new ideas at the Allen Institute for AI.

Original. Reposted with permission.

Related: