2021: A Year Full of Amazing AI papers — A Review

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

comments

By Louis Bouchard, Focused on making AI accessible. "What's AI" on YouTube and Medium.

While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. What’s more, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency, and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness

Here are the most interesting research papers of the year, in case you missed any of them. In short, it is a curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read!

DALL·E: Zero-Shot Text-to-Image Generation from OpenAI [1]

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

Short Video Explanation:

Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
Paper: Zero-Shot Text-to-Image Generation
Code: Code & more information for the discrete VAE used for DALL·E

VOGUE: Try-On by StyleGAN Interpolation Optimization [2]

Google used a modified StyleGAN2 architecture to create an online fitting room where you can automatically try on any pants or shirts you want using only an image of yourself.

Short Video Explanation:

Short read: The AI-Powered Online Fitting Room: VOGUE
Paper: VOGUE: Try-On by StyleGAN Interpolation Optimization

Taming Transformers for High-Resolution Image Synthesis [3]

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

Short Video Explanation:

Thinking Fast And Slow in AI [4]

Drawing inspiration from Human Capabilities Towards a more general and trustworthy AI & 10 Questions for the AI Research Community.

Short Video Explanation:

Short read: Third Wave of AI | Thinking Fast and Slow
Paper: Thinking Fast And Slow in AI

Automatic detection and quantification of floating marine macro-litter in aerial images [5]

Odei Garcia-Garin et al. from the University of Barcelona have developed a deep learning-based algorithm able to detect and quantify floating garbage from aerial images. They also made a web-oriented application allowing users to identify these garbages, called floating marine macro-litter, or FMML, within images of the sea surface.

Short Video Explanation:

ShaRF: Shape-conditioned Radiance Fields from a Single View [6]

Just imagine how cool it would be to just take a picture of an object and have it in 3D to insert in the movie or video game you are creating or in a 3D scene for an illustration.

Short Video Explanation:

Generative Adversarial Transformers [7]

They basically leverage transformers’ attention mechanism in the powerful StyleGAN2 architecture to make it even more powerful!

Short Video Explanation:

Short read: GANsformers: Scene Generation with Generative Adversarial Transformers
Paper: Generative Adversarial Transformers
Click here for the code

We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right? [8]

Would you swipe right on an AI profile? Can you distinguish an actual human from a machine? This is what this study reveals using AI-made-up people on dating apps.

Short Video Explanation:

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [9]

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

Short Video Explanation:

Image GANs Meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering [10]

This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated!

Short Video Explanation:

Deep nets: What have they ever done for vision? [11]

"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."

Short Video Explanation:

Short read: What is the state of AI in computer vision?
Paper: Deep nets: What have they ever done for vision?

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [12]

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Short Video Explanation:

Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control [13]

With this AI-powered nerve interface, the amputee can control a neuroprosthetic hand with life-like dexterity and intuitiveness.

Short Video Explanation:

Short read: amputee with an AI-Powered Hand!
Paper: Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control

Total Relighting: Learning to Relight Portraits for Background Replacement [14]

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

Short Video Explanation:

Short read: Realistic Lighting on Different Backgrounds
Paper: Total Relighting: Learning to Relight Portraits for Background Replacement

LASR: Learning Articulated Shape Reconstruction from a Monocular Video [15]

Generate 3D models of humans or animals moving from only a short video as input. This is a new method for generating 3D models of humans or animals moving from only a short video as input. Indeed, it actually understands that this is an odd shape, that it can move, but still needs to stay attached as this is still one "object" and not just many objects together...

Short Video Explanation:

Enhancing Photorealism Enhancement [16]

This AI can be applied live to the video game and transform every frame to look much more natural. The researchers from Intel Labs just published this paper called Enhancing Photorealism Enhancement. And if you think that this may be "just another GAN," taking a picture of the video game as an input and changing it following the style of the natural world, let me change your mind. They worked on this model for two years to make it extremely robust. It can be applied live to the video game and transform every frame to look much more natural. Just imagine the possibilities where you can put a lot less effort into the game graphic, make it super stable and complete, then improve the style using this model.

Short Video Explanation:

Short read: Is AI The Future Of Video Game Design? Enhancing Photorealism Enhancement
Paper: Enhancing Photorealism Enhancement
Click here for the code

DefakeHop: A Light-Weight High-Performance Deepfake Detector [17]

How to Spot a Deep Fake in 2021. Breakthrough US Army technology using artificial intelligence to find deepfakes.

While they seem like they’ve always been there, the very first realistic deepfake didn’t appear until 2017. It went from the first-ever resembling fake images automatically generated to today’s identical copy of someone on videos, with sound.

The reality is that we cannot see the difference between a real video or picture and a deepfake anymore. How can we tell what’s real from what isn’t? How can audio files or video files be used in court as proof if an AI can entirely generate them? Well, this new paper may provide answers to these questions. And the answer here may again be the use of artificial intelligence. The saying “I’ll believe it when I’ll see it” may soon change for “I’ll believe it when the AI tells me to believe it.”

Short Video Explanation:

High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network [18]

Apply any style to your 4K image in real-time using this new machine learning-based approach!

Short Video Explanation:

Barbershop: GAN-based Image Compositing using Segmentation Masks [19]

This article is not about new technology in itself. Instead, it is about a new and exciting application of GANs. Indeed, you saw the title, and it wasn’t clickbait. This AI can transfer your hair to see how it would look like before committing to the change.

Short Video Explanation:

TextStyleBrush: Transfer of text aesthetics from a single example [20]

This new Facebook AI model can translate or edit text directly in the image in your own language, following the same style!

Imagine you are on vacation in another country where you do not speak the language. You want to try out a local restaurant, but their menu is in the language you don’t speak. I think this won’t be too hard to imagine as most of us have already faced this situation whether you see menu items or directions and you can’t understand what’s written. Well, in 2020, you would take out your phone and google translate what you see. In 2021 you don’t even need to open google translate anymore and try to write what you see one by one to translate it. Instead, you can simply use this new model by Facebook AI to translate every text in the image in your own language.

Short Video Explanation:

If you’d like to read more research papers as well, I recommend you read my article, where I share my best tips for finding and reading more research papers.

Animating Pictures with Eulerian Motion Fields [21]

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely, still creating amazing-looking videos like this one.

Short Video Explanation:

Short read: Create Realistic Animated Looping Videos from Pictures
Paper: Animating Pictures with Eulerian Motion Fields
Click here for the code

CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [22]

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

Short Video Explanation:

GitHub Copilot & Codex: Evaluating Large Language Models Trained on Code [23]

Find out how this new model from OpenAI Generates Code From Words!

Short Video Explanation:

Short read: OpenAI's New Code Generator: GitHub Copilot (and Codex)
Paper: Evaluating Large Language Models Trained on Code
Click here for the code

Apple: Recognizing People in Photos Through Private On-Device Machine Learning [24]

Using multiple machine learning-based algorithms running privately on your device, Apple allows you to accurately curate and organize your images and videos on iOS 15.

Short Video Explanation:

Image Synthesis and Editing with Stochastic Differential Equations [25]

Say goodbye to complex GAN and transformer architectures for image generation! This new method by Chenling Meng et al. from Stanford University and Carnegie Mellon University can generate new images from any user-based inputs. Even people like me with zero artistic skills can now generate beautiful images or modifications out of quick sketches.

Short Video Explanation:

Sketch Your Own GAN [26]

Make GANs training easier for everyone by generating Images following a sketch! Indeed, whit this new method, you can control your GAN’s outputs based on the simplest type of knowledge you could provide: hand-drawn sketches.

Short Video Explanation:

Short read: Make GANs Training Easier for Everyone : Generate Images Following a Sketch
Paper: Sketch Your Own GAN
Click here for the code

Tesla's Autopilot Explained [27]

If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.

Short Video Explanation:

Short read: Tesla's Autopilot Explained

Styleclip: Text-driven manipulation of StyleGAN imagery [28]

AI could generate images, and then, using a lot of brainpower and trial and error, researchers could control the results following specific styles. Now, with this new model, you can do that using only text!

Short Video Explanation:

TimeLens: Event-based Video Frame Interpolation [29]

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

Short Video Explanation:

Short read: How to Make Slow Motion Videos With AI!
Paper: TimeLens: Event-based Video Frame Interpolation
Click here for the code

Diverse Generation from a Single Video Made Possible [30]

Have you ever wanted to edit a video?

Remove or add someone, change the background, make it last a bit longer, or change the resolution to fit a specific aspect ratio without compressing or stretching it. For those of you who already ran advertisement campaigns, you certainly wanted to have variations of your videos for AB testing and see what works best. Well, this new research by Niv Haim et al. can help you do all of these out of a single video and in HD!

Indeed, using a simple video, you can perform any tasks I just mentioned in seconds or a few minutes for high-quality videos. You can basically use it for any video manipulation or video generation application you have in mind. It even outperforms GANs in all ways and doesn’t use any deep learning fancy research nor requires a huge and impractical dataset! And the best thing is that this technique is scalable to high-resolution videos.

Short Video Explanation:

Skillful Precipitation Nowcasting using Deep Generative Models of Radar [31]

DeepMind just released a Generative model able to outperform widely-used nowcasting methods in 89% of situations for its accuracy and usefulness assessed by more than 50 expert meteorologists! Their model focuses on predicting precipitations in the next 2 hours and achieves that surprisingly well. It is a generative model, which means that it will generate the forecasts instead of simply predicting them. It basically takes radar data from the past to create future radar data. So using both time and spatial components from the past, they can generate what it will look like in the near future.

You can see this as the same as Snapchat filters, taking your face and generating a new face with modifications on it. To train such a generative model, you need a bunch of data from both the human faces and the kind of face you want to generate. Then, using a very similar model trained for many hours, you will have a powerful generative model. This kind of model often uses GANs architectures for training purposes and then uses the generator model independently.

Short Video Explanation:

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks [32]

Have you ever tuned in to a video or a TV show, and the actors were completely inaudible, or the music was way too loud? Well, this problem, also called the cocktail party problem, may never happen again. Mitsubishi and Indiana University just published a new model as well as a new dataset tackling this task of identifying the right soundtrack. For example, if we take the same audio clip we just ran with the music way too loud, you can simply turn up or down the audio track you want to give more importance to the speech than the music.

The problem here is isolating any independent sound source from a complex acoustic scene like a movie scene or a youtube video where some sounds are not well balanced. Sometimes you simply cannot hear some actors because of the music playing or explosions or other ambient sounds in the background. Well, if you successfully isolate the different categories in a soundtrack, it means that you can also turn up or down only one of them, like turning down the music a bit to hear all the other actors correctly. This is exactly what the researchers achieved.

Short Video Explanation:

ADOP: Approximate Differentiable One-Pixel Point Rendering [33]

Imagine you want to generate a 3D model or simply a fluid video out of a bunch of pictures you took. Well, it is now possible! I don't want to give out too much, but the results are simply amazing, and you need to check it out by yourself!

Short Video Explanation:

(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [34]

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above--such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

Short Video Explanation:

Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
CLIPDraw Colab demo
StyleCLIPDraw Colab demo

SwinIR: Image restoration using swin transformer [35]

Have you ever had an image you really liked and could only manage to find a small version of it that looked like this image below on the left? How cool would it be if you could take this image and make it twice look as good? It’s great, but what if you could make it even four or eight times more high definition? Now we’re talking, just look at that.

Here we enhanced the resolution of the image by a factor of four, meaning that we have four times more height and width pixels for more details, making it look a lot smoother. The best thing is that this is done within a few seconds, completely automatically, and works with pretty much any image. Oh, and you can even use it yourself with a demo they made available.

Short Video Explanation:

Short read: SwinIR: Image restoration using swin transformer
Paper: SwinIR: Image restoration using swin transformer
Click here for the code
Demo

Paper references

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Lewis, Kathleen M, et al., (2021), VOGUE: Try-On by StyleGAN Interpolation Optimization.

[3] Taming Transformers for High-Resolution Image Synthesis, Esser, et al., 2020.

[4] Thinking Fast And Slow in AI, Booch, et al., (2020), https://arxiv.org/abs/2010.06002.

[5] Odei Garcia-Garin et al., Automatic detection and quantification of floating marine macro-litter in aerial images: Introducing a novel deep learning approach connected to a web application in R, Environmental Pollution, https://doi.org/10.1016/j.envpol.2021.116490.

[6] Rematas, K., Martin-Brualla, R., and Ferrari, V., “ShaRF: Shape-conditioned Radiance Fields from a Single View”, (2021), https://arxiv.org/abs/2102.08860

[7] Drew A. Hudson and C. Lawrence Zitnick, Generative Adversarial Transformers, (2021)

[8] Sandra Bryant et al., “We Asked Artificial Intelligence to Create Dating Profiles. Would You Swipe Right?”, (2021), UNSW Sydney blog.

[9] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[10] Zhang, Y., Chen, W., Ling, H., Gao, J., Zhang, Y., Torralba, A. and Fidler, S., 2020. Image gans meet differentiable rendering for inverse graphics and interpretable 3d neural rendering. arXiv preprint arXiv:2010.09125.

[11] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

[12] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[13] Nguyen & Drealan et al. (2021) A Portable, Self-Contained Neuroprosthetic Hand with Deep Learning-Based Finger Control: https://arxiv.org/abs/2103.13452

[14] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf.

[15] Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://lasr-google.github.io/.

[16] Richter, Abu AlHaija, Koltun, (2021), "Enhancing Photorealism Enhancement", https://intel-isl.github.io/PhotorealismEnhancement/.

[17] DeepFakeHop: Chen, Hong-Shuo, et al., (2021), “DefakeHop: A Light-Weight High-Performance Deepfake Detector.” ArXiv abs/2103.06929.

[18] Liang, Jie and Zeng, Hui and Zhang, Lei, (2021), "High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network", https://export.arxiv.org/pdf/2105.09188.pdf.

[19] Peihao Zhu et al., (2021), Barbershop, https://arxiv.org/pdf/2106.01505.pdf.

[20] Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner, Facebook AI, (2021), ”TextStyleBrush: Transfer of text aesthetics from a single example”.

[21] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[22] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.

[23] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. and Ray, A., 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

[24] Apple, “Recognizing People in Photos Through Private On-Device Machine Learning”, (2021), https://machinelearning.apple.com/research/recognizing-people-photos

[25] Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.Y. and Ermon, S., 2021. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.

[26] Wang, S.Y., Bau, D. and Zhu, J.Y., 2021. Sketch Your Own GAN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14050-14060).

[27] “Tesla AI Day”, Tesla, August 19th, 2021, https://youtu.be/j0z4FweCy4M

[28] Patashnik, Or, et al., (2021), “Styleclip: Text-driven manipulation of StyleGAN imagery.”, https://arxiv.org/abs/2103.17249

[29] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[30] Haim, N., Feinstein, B., Granot, N., Shocher, A., Bagon, S., Dekel, T., & Irani, M. (2021). Diverse Generation from a Single Video Made Possible, https://arxiv.org/abs/2109.08591.

[31] Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou, M., Kashem, S., Madge, S. and Prudden, R., 2021. Skillful Precipitation Nowcasting using Deep Generative Models of Radar, https://www.nature.com/articles/s41586-021-03854-z

[32] Petermann, D., Wichern, G., Wang, Z., & Roux, J.L. (2021). The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks. https://arxiv.org/pdf/2110.09958.pdf.

[33] Rückert, D., Franke, L. and Stamminger, M., 2021. ADOP: Approximate Differentiable One-Pixel Point Rendering, https://arxiv.org/pdf/2110.06635.pdf.

[34] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[35] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L. and Timofte, R., 2021. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1833–1844).

Original. Reposted with permission.

Related: