Age of AI Conference 2018 – Day 1 Highlights

Here are some of the highlights from the first day of the Age of AI Conference, January 31, at the Regency Ballroom in San Francisco.

These are some of the highlights from the Day 1 of the Age of AI Conference, held on January 31 and February 1, 2018, at the Regency Ballroom in San Francisco.

The Conference owes its origins in the San Francisco Artificial Intelligence meetup that Emil Mikhailov started for the interested ones to learn, network and share.  The community now boasts of 4,700+ members and has previously hosted heavyweights like Andrew Ng and Nvidia CEO Jensen Huang.  

The Regency Ballroom boasts good location and acoustics.  The best part was the technical focus of the Conference, well punctuated with some‘global minima’ but thought-provokingtouches.I will strive to do some justice to the rich technical content.

Here are the highlights of Day 1, Wednesday, January 31.


Balaji Laxminarayanan, Senior Research Scientist, DeepMind

Understanding Generative Adversarial Networks


Key Points:

  • This talk delves into some theory behind Generative Adversarial Networks (GANs).  How do GANs relate to other ideas in probabilistic machine learning?  Most models in machine learning and statistics are of the prescribed probabilistic model type – they come with a conditional log-likelihood function.  E.g. object recognition classifiers.  Implicit probabilistic models, merely use a stochastic procedure to generate the data.  E.g. if we know broadly how the model works, we use it to generate data to study ecology, climate and weather patterns.
  • Hypothesis testing is a principle for learning in implicit generative models and done by density ratio estimation.  This is via four approaches: Class probability estimation, Density ratio matching, Divergence minimization and Moment matching.  A summary of these approaches is in the figure below.
  • The team trained a generator by maximum likelihood and by Wasserstein GAN (WGAN), compared them by using two tools: real NVP to compute the exact log-probability densities and an independent critic to compare the approximate Wasserstein distances on the validation set.  They found that: Wasserstein distance can compare models.  Wasserstein distance can be approximated by training a critic.  Training by WGAN leads to better samples but worse log-probabilities.
  • For learning latent variable models (that is, statistical models that have hidden or unobserved variables), two popular approaches are variational auto-encoders (VAEs) and GANs.  GANs can train on large datasets, are fast to simulate and when trained on images, generate visually compelling sample images.  But, they can become unstable in optimization leading to mode-collapse, where the generated data does not represent the diversity of the underlying data distribution.  VAEs help in inference of the latent variables, very useful in representation learning and visualization, do not suffer from mode-collapse, but alas, generate blurry images.  Hence the rationale for combining the best of these two methods.  They found that: gradient penalties stabilize (non-Wasserstein) GANs as well and one needs to think of both – the ideal loss function and the optimization.
  • GANs for imitation learning: See this YouTube video (link) that summarizes paper/effort by Josh Merel et. al.  Balaji concluded with some of the other areas of exciting research around using ideas from convergence of Nash equilibria, connections to Reinforcement Learning (RL) and control theory.


Learning in Implicit Generative Models

Comparison of Maximum Likelihood and GAN-based training of Real NVPs

Variational Approaches for Auto-Encoding Generative Adversarial Networks

Many Paths to Equilibrium: GANs Do Not Need to Decrease Divergence At Every Step

Learning human behaviors from motion capture by adversarial imitation

Balaji’s website


Tarin Ziyaee, CTO at



Spun out of Udacity’s self-driving car program, Udacity Vice President Oliver Cameron co-founded Voyage.  Crunchbase says InMotion Ventures has lead the $15M investment in this driverless car startup working at ‘Level 4 automation’.  

Key Points:

  • Autonomous driving today is where flight was about a 100 years ago.  Progress is going to be incremental.  We focus on the algorithmic part and have partners such as Carmera (HD maps) providing other expertise.
  • Voyage hasdeployed in Florida (link: The Villages) and CA in a private retirement community as a door to door self-driving taxi service to the residents.  Currently, they all have a safety driver in the car.  
  • It’s a geofenced area with 150,000 residents and 750 miles of road that keeps it bounded.  Yet, it has all the complexities that one might find elsewhere, including even darting deer and waddling ducks, cyclists, weddings that happen, etc.  Voyage cars can go up to 25-30 miles/hour.
  • Considering the tough competition in the autonomous driving space, Voyage is playing with different monetization models.
  • Three tenets have guided Voyage’s‘hygienic design principles’
    1. Do not infer, that which you can measure.
    2. Universal approximators are good, but universally approximating, not so good.
    3. Don’t boil the ocean.


Tarin’s Google Scholar:


Augustus Odena, Researcher at Google Brain

GANs and Geometry


Key Points:

  • GAN variants have been spawning like rabbits but this study pointed out that none outperformed the original.GANs are also hampered by unstable training and by the lack of proper evaluation metrics.
  • This paper showed the GAN training model can be decomposed into three geometric steps: separating hyperplane search, discriminator parameter update away from the separating hyperplane, and the generator update along the normal vector direction of the separating hyperplane.  The geometric GAN converges to a Nash equilibrium between the discriminator and generator.  However, GANs are usually trained using gradient descent techniques designed to find a low value of the cost function, not to find the Nash equilibrium and so these algorithms may fail to converge.
  • A matrix of all first-order partial derivatives of a vector-value function is called a Jacobian matrix in vector calculus.  When this matrix is a square matrix, both the matrix and its determinant are referred to as the Jacobian in literature.
  • One starts with the Jacobian of the Generator in GAN.  The generator takes elements of Z to elements of X.  Thus, its Jacobian is of the shape dim(X) x dim (Z).  There is a different Jacobian J_z for every point in z.  J_z tells how sensitive G(z) is to changes in z.
  • Two main methods to evaluate GANs: Inception Score and Frechet Inception Distance (FID)
  • Unconditioned GAN: No control on the modes of data generated.  It is possible to condition the GAN by feeding extra information (e.g. class label or multi-modal) to both the generator and the discriminator as additional input layer.  But then, how does one measure this conditioning?  Is it related in any way to the Inception Score and/or FID?
  • Turns out there’s a (surprising) correspondence and they are causally linked.  Here’s how to tell:
    • Feed noise z and slighted perturbed noise z’ through the generator.  
    • Measure how different G(z) and G(z’) are
    • If too different, penalty!
    • If too same, penalty!
  • Thus, one can measure the conditioning of the generator.  It corresponds to the Inception Score and the FID.  One can intervene to improve the conditioning which makes the GAN perform better.


Geometric GAN

Improved Techniques for Training GANs

Augustus’ Google Scholar link:


Roman Trusov, Researcher at

Semantic Segmentation in the Wild


Key Points:

  • Semantic segmentation is understanding an image and assign each pixel an object class.  So, the task is to group pixels into regions that contain objects of a certain class.  Examples include: robot vision, autonomous driving and medical imaging.
  • To perform semantic segmentation using Neural Networks, the traditional feature extraction is redundant as it builds a ‘deep representation’ from the whole image and is even detrimental for quality.  Segment the image first and then apply feature extraction.  
  • There is no consensus on the training routine: use a large batch or a small learning rate
  • An inference engine is needed.  Depending on the architecture, 3x-5x speedups may be seen if some best practices are followed.  These include: conversion to static graph, dynamic memory allocation, graph optimization, disabling backward pass, etc.
  • There is a tradeoff between accuracy and performance.  So, semantic segmentation does not scale to execution on frame by frame in real time.  So, video from a dashcam at 15fps or at the most 25fps is doable.


Roman Trusov’s Quora page:


Christian Szegedy, Research Scientist at Google

Towards Auto-Formalization of Mathematics


Key Points:

  • Using Math to express makes solutions easier to implement and self-referential.  It has natural reproducibility, the language of choice for anything related to reasoning, allows the deepest, most hierarchical and complex content ever created and required for programming, physics, etc.
  • The Mizar Mathematical Library is a system for formalizing and proof-checking mathematics invented by Andrzej Trybulec, collected over 44 years.  Its verification engine is designed to preserve human understanding of proof steps.
  • Do computers really understand text?  Recurrent neural networks have improved machine translation.  Idea is to use an Auto-formalization approach to NLP.   The hope is that at the end of this process, it will become a strong translator between a formal and informal language process.  Once you have this kind of a mathematical language interpreter, then it could be extended to almost anything.
  • The challenges are two-fold: One, Premise Selection, to pick a few of the possible 150,000 premises needs 100% recall.  Previously proposed approach suggests use k-NN search with hand-crafted features.  Two, Large search space means one has to use brute-force and so a fast hand-crafted heuristic is used for selecting the next proof step.
  • Using Deep Learning for Premise Selection, avoids hand-engineering features and is an important step towards automatic theorem proving.
  • Using a few proof guidance strategies with deep neural networks, they found first-order proofs of 7.36% of the first-order logic translations of the Mizar Mathematical Library theorems that previously did not have Automated Theorem Provers generated proofs.
  • Humans prefer higher order logic and there are four major theorem provers:
    • Isabelle (SML)
    • Coq (OCaml)
    • HOL4 (Poly-ML)
    • HOL-light (OCaml)

    Christian’s team use the HOL-light (OCaml) theorem prover.



Google’s Multilingual Neural Machine Translation System

Kaliszyk, Cezary, and Josef Urban.  “MizAR 40 for Mizar 40”.  Journal of Automated Reasoning 55.3 (2015): 245-256

Schulz, Stephan.  “E-a brainiac theorem prover.”  AI Communications 15.2 3 (2002): 111:126

DeepMath-Deep Sequence Models for Premise Selection

Deep Network Guided Proof Search

Christian’s Google Scholar page: