Key Trends and Takeaways from RE•WORK Deep Learning Summit Montreal – Part 1: Computer Vision

Read up on what you missed from the RE•WORK Deep Learning Summit Montreal, held October 10 & 11, including talks from Aaron Courville, Ira Kemelmacher-Shlizerman, Roland Memisevic, and Raquel Urtasun.

Last week I was fortunate enough to have attended the RE•WORK Deep Learning Summit Montreal (October 10 & 11), and was able to take in a number of quality talks and meet with other attendees. The conference was split into 2 tracks -- Research Advancements and Business Applications -- and featured a wide array of top neural networks researchers and academics, as well as business leaders. An interesting mix of both industry and academic, RE•WORK did more than enough to prove their professionalism and attention to detail, and this is without mentioning the calibre of speakers they secured for the event.

Prevailing themes and concepts of the Research Advancements track, in which I was enrolled, included:

  • Computer Vision - The track started off with a hefty dose of one of the domains currently most associated with neural networks, with related talks sprinkled throughout the rest of the schedule as well.
  • Convolutional Neural Networks - This could be considered a subset of the above point, but CNNs are given separate, top-level treatment at the event (plus, not all CNN talks have vision- or image-related applications).
  • Neural Networks Models, Architecture & Frameworks - A block of disparate but loosely related talks.
  • Speech Recognition - Alongside computer vision and image recognition, the other most generally recognized application domains for neural networks had its own block of talks.
  • The Pioneers - Lastly, the event has (rightfully) placed a high emphasis on the presence of deep learning demigods Yoshua Bengio, Yann LeCun, and Geoff Hinton. This is the first event which has had all 3 appear as speakers, as well collectively partake in a panel discussion. This is a real feather in RE•WORK's cap, especially pulling it off in Montreal, the emerging epicenter of deep learning and AI research, as well as the trio's stomping ground (more broadly speaking, Canada).

What follows is a summary of some of my favorite talks from the conference, with this selection revolving around the visual reasoning & computer vision blocks which started the conference off. A full listing of the speakers and schedule can be found here.

Aaron Courville, of the University of Montreal, kicked off the research developments track of the conference with his talk titled Visual Reasoning via Feature-wise Linear Modulation. He started with a quick summary of the rapid progress in CV over the past few years, from CNNs to the availability of large datasets and in increase in computational capacity (notably GPUs). He was also deliberate in pointing out that innovative models and learning algorithms did not play second fiddle to these advances, however.

Aaron then introduced his recent work on visual question answering and visual dialog, namely 'Guess What?,' a 2 player game where questions are posed to a system in order to help identify the object which the system has identified in an image as the "what" to guess (CVPR 2017). The game is composed of 2 asymmetric agents: an oracle, which is essentially a supervised learning task, and a questioner, a reinforcement learning task. The questioner poses questions to the oracle, to which the oracle can answer yes or no. Clearly, the oracle must be trained to have some understanding of what is taking place in a given image.

Courville Re-Work

Previous work on conditioned reasoning includes utilizing the CLEVR dataset at FAIR and Santoro, et. al (2017). On the CLEVR dataset, Guess What? achieves SOTA accuracy with no program data and without a strongly task-oriented model design. These results are based on the current research for the guess what system, which employs Feature-wise Linear Modulation (FiLM)-based visual reasoning (see above paper, Courville, et. al).

It is worth noting that the student spearheading this project is an undergraduate, as per Courville.

Courville closes by noting that visual question answering, visual dialog, and visual reasoning all offer a way to explore the detailed semantics of the scenery laid out in images, and responsibly qualifies this by stating that the devil is in the details: proper architecture will have a huge impact on performance, something that generalizes well to almost all types of deep learning tasks. He also singles out the contribution of FiLM and related conditional scaling and shifting as invaluable tools for integrating language in these types of visual tasks.

Next up was Raquel Urtasun, of both Uber ATG and the University of Toronto, who treated us to Deep Learning for Self-Learning Cars. Raquel started off by securing some applause for answering the question of why is she not in Silicon Valley? Because Canada! She then led us through some of the standard self-driving car fare, but did so in an engaging manner, which is formidable in such a previously well-tread, go-to deep learning presentation topic as self-diving cars.

Urtasun began by covering what could reasonably be considered the autonomous vehicle basics: pixel segmentation (boundaries are important!), bounding boxes, the importance of tracking vehicles over time. She then got to discussing building and maintaining maps, which was genuinely insightful.

Raquel noted that the traditional method to map-building -- via slow, on-the-ground, repetitive, road-traversal -- is expensive, and slow. Instead, let's do it from the air. Planes, drones, satellites, and LIDAR can all be employed to scan and record topography from afar at scale. Map-building, she notes, is also a collaborative process, and can (should?) be crafted from a combination of sources, including from air, library maps, ground, cars, panoramas, and more.

Urtasun stressed the point that autonomous driving systems should be informed by a combination of sensors -- GPS, cameras, etc. -- but that we must be certain that the job can be adequately performed even if any single sensor fails. Without GPS availability while driving amongst the skyscrapers of downtown Toronto, for example, we should still be confident that autonomous driving is to be trusted based on the data and sensors at its disposal at that time.

Ira Kemelmacher-Shlizerman of Allen School of Computer Science and Facebook then presented her talk, Learning Lip Sync from Audio.

Ira Kemelmacher-Shlizerman

In an interesting presentation, Ira discussed what the impact of such a technology might be, including people modeling in the form of both telepresence and higher quality video transmission (video up-scaling). She also discussed the difficulties of going from a one-dimensional signal (audio) to three-dimensional motion video, and the added layer of ideally capturing facial expressions and matching body language, etc.

The work of Kemelmacher-Shlizerman and her colleagues followed a 4 stage algorithm:

  1. Audio signal to lip outlines; makes use of a recurrent neural network
  2. Map lip lines to images
  3. Sharpen image, which includes lip lines
  4. Post-processing -> the "final composition" which blends work with actual video

For training, there was a requirement of lots of video data in the public domain, which is why Barack Obama was settled upon as a training candidate. Crucial for training was the ability to peek into the future, in order to differentiate between, for example, an audio snippet which includes "teapot" vs. "tea house."

Step 4, post-processing, included the locating of the most appropriate section of video onto which to map the lip-synced images. Time-warping is one of the tricks her team employed, in order to help match body and head motion as desired. For example, it would not be advantageous to have a visibly-happy person discussing some sad event.

Kemelmacher-Shlizerman points out that the results are no perfect -- color bleeding, double chins, lack of emotion modeling -- but it is impressive work on which iterative improvements will undoubtedly be quickly built.

Roland Memisevic of Twenty Billion Neurons gave a talk, titled Common Sense Video Understanding at TwentyBN. Roland presented great research, but like Aaron Courville took a few minutes to pay tribute to deep learning innovation itself.

Roland stressed that deep learning is not linear progression, and that successive deep learning breakthroughs are greater than the sum of their parts, while recognizing that these advances are driven by the massive amounts of data we now have at out fingertips as well as available computational power.

He then broke into his talk on the research being done at Twenty Billion Neurons, and their progress in video understanding, and its need for developing inference capabilities in AI in general. Their goal at Twenty Billion Neurons is nothing less than admirably (and lofty) "enabling video understanding."

Difficulties they face include that video is more structured than objects, as it layers sequences and interactions on top of what are relatively simple object descriptions. Training is also difficult; it is not easy to locate the particular video examples you are looking for at any given time. Conversely, creating video data actually takes enormous amounts of planning before the execution, and refinement afterward. However you slice it, the video data creation/curation problem is very real. Contrastive classes -- picking up vs. putting down, standing up vs. sitting down, etc. -- make data collection and/or creation more difficult. Yet these contrastive classes also make learning harder, and strengthen the resulting neural networks.

This is but a sampling of some of the interesting talks from RE•WORK Deep Learning Summit Montreal. I hope to share some more takeaways later in the week.