Deep Learning and Startups: Notes on Rework Conference, San Francisco

The Rework Deep Learning conference came to San Francisco this past January, and showcased both prominent deep learning researchers and startups. Get an overview of the proceedings with notes from an attendee.



Deep Learning

Speaker/Company name: Christopher Manning/Prof. CS&Linguistics at Stanford

Theme of presentation: Deep Learning the Precise Meaning of Language.

Technology advancements: Ratios of co-occurrence probabilities can encode meaning components and give great gains over Categorical CRF, SVD, C&W (other popular methods). There remains the hard problems of compositionality. I.e, how do we go from understanding smaller things to larger things? They have exploited transfer learning adaptation and ensemble modeling (leveraging DL) to gain 25% improvement of translating TED talks. Now developing representations of universal dependencies (universal across languages) with neural network based dependency parser.

Key take-away: Language is the way we transfer knowledge over time and space. If we start to unlock NL-understanding, we must allow computers to learn from our wealth of digitized resources.

Speaker/Company name: Andrej Karpathy/PhD Student at Stanford

Theme of presentation: Visualizing and Understanding Recurrent Networks

Recurrent Neural Networks (RNNs) offer flexibility. Instead of allowing for a fixed input and output size, we input sequences and output sequences. Great for sentiment classification, videos (frames–so no arbitrary cutoff for what the output must be a function of).

Great rhetorical flourish: “Generate, sample or hallucinate the output.”

Example applications: Generate poetry via feeding in character level RNN of the works of Shakespeare. generate Latex documents of Algebraic Geometry research almost complies., generate C code. All really good at capturing the statistical patterns of data.

Requires only 112 lines of Python code (char-rnn (Torch7))!

Currently studying how this works, and how things change at scale. There are interpretable cells in the RNN. Eg. Ones that are for quote detection, detect new lines. Also implemented, combine RNN with CNN. Gives a good description of images (also funny failure cases). Another use is to query image database with text. ConvNet, and RNN, stuck those blocks together on images. Test image, then feed it through CNN, instead of a classifier at the end, we redirect the rep through the RNN, give prob of first word in the description. Using data from Amazon Mechanical Turk. Failure cases. Extended model tackling joint task of both detection and description. Fully convolutional, fully differentiable. Query with small piece of text, and then look through images for that query.

Check out his class at Stanford with Fei Fei Li and Justin Johnson at http://cs231n.stanford.edu/ .

Speaker/Company name: Ian Goodfellow/Research Scientist Google

Theme of presentation: Tutorial on Optimization for Deep Networks

Best practices and how they work: The correct metaphor when thinking about analyzing the contribution of each layer of a NN is to think of NN as a sequence of matrix multiplications (one matrix multiplication per level). Of course, it is difficult with an increase in layers to figure out the effect of one layer. It turns out that in practice, the error surfaces has a thin area of minimal error, and very large slope around it, which makes it easy to overshoot.

Objections: There are no mathematical guarantees to how we find good minima (of course not a convex surface!). However, plotting error over time in the application cases gets us a surprisingly smooth and monotonic curve.

Theoretical developments: Batch Normalization–normalized by mean and sd so that your gradient descent can easily adjust to differences in how each layer affects output. In practice, this makes training more easy and faster.

Speaker/Company name: Roland Memisevic/Assistant Professor University of Montreal

Theme of presentation: Deep Learning as a Compute Paradigm.

Switch from CPUs to GPUs allowed us to leverage the level of parallelization allowed in NN and made current success possible. How can we go a step further and make it more efficient on the level of hardware? DL paradigm. Specific, more accurate but more data needed. Generic, less accurate but less data needed. Classic paradigm: specific, faster but tedious to program, generic: slower but easy to program. In sum, DL is more about accuracy than specificity.
Humans are way more on the generic side. Use very little data, and solve things in generality. Think embodied cognition. Humans are champions of solving new verticals with premade architecture for another task.

Key take-away: We should model the hardware to exploit the parallel operations in neural networks.

Also check out Twenty Billion Neurons (new startup).

Panel on How Important will Deep Learning Applications be for Future Economic Growth.

Concern for malevolent AI, hype or reality? Answer: Hype, a nice news story.

DL divides the Haves and have notes, will it’s progress worsen this divide? Answer: DL is really open sourced in nature, so actually way more accessible for companies than before. The rising tide is for specialists to be able to focus on their specialty than having to reinvent the wheel every time. This enables them to do more. Concluding thoughts: echo Andrew Ng’s opinion that we need for regulation of safety rather than solutions. Government is not the expert, so should make guidelines not prescribe what a solution should be.

DeepMind

Speaker/Company name: Oriol Vinyals/Google DeepMind

Theme of presentation: Sequence to Sequence Learning @Google Brain.

Technology advancements: Language modeling using RNN. Thanks to LSTM, Theano, Torch, implementation is easier than ever. Sequence to sequence learning allows us to try many new things, and get better at image captions, speech, parsing, dialogue, video generation, algorithm (learning of). Key to success is to believe it will work and work hard at optimizing parameters.

Key take-away: Sequences have become first class citizens. With enough data, they can make very complicated functions. Attention made recurrent models even better.

Speaker/Company name: Aditya Khosla/PhD Student MIT

Theme of presentation: Understanding human behaviour through visual media.

Can we predict what kind of images people will like or remember? Can we predict personality from a picture? Someone’s political inclinations through an image? Can we predict a person’s state of mind?

Technology: Developed iTracker CNN to figure out where people are looking. Within 2cm of error, expect it to get even better. Demo can be found at: http://gazefollow.csail.mit.edu

Application: Good for diagnosis for autism and schizophrenia. Obviously also advertising.

Pieter Abbeel

Associate professor at Berkeley EECS

Deep reinforcement learning for robots.

What is the standard approach for robots.

Percepts (accelerometer), hand engineered state estimate, hand generated control policy class, hand-tuned 10is free parameters, motor commands. Replace 3 middle steps with DNN. This mirrors revolution in vision and speech recognition.

But not the same, because robotics is NOT a supervised learning problem. Robotics has feedback loop. Sparse reward function (cook dinner, after present it we give stars).

3 challenges: stability. Veer off track as a car, then we are far from what we from data is not represented. Sparse supervision.

Deep Reinforcement learning locomotion. Give it reward, how hard we hit the ground, how far forward we get. transferable via different types of robots. Very flexible.

Frontiers/future. Memory, estimation, temporal hierarchy/goal setting. shared and transfer learning.

Lise Getoor, University of California, Santa Cruz

Scalable Collection Reasoning in Graphs.

Setting: structured data. “Big Data is not flat” Multimodal. spatio-temporal.
Take nicely structured data, flatten into table, and do not leverage the structure.

NEED: ML for Graphs. Input graphs. (not just sequences).

Challenges: components are inter and intra dependent.

Goal: input graph infer an output graph.

Key idea: predictions/outputs depend on each other joint reasoning is required. Challenge, Really large and really loopy graphs.

Tool: Probabilistic Soft Logic. A declarative probabilistic programming language for collective inference problems on richly structured graph data. Solves most of these problems.

Summarize: Need techniques to deal with structured data. New opportunities for algos that deal with this. Their tool to tackle this is PSL. getoor@usc.edu

Bio: Lisha Li (@lishali88) is currently a Ph.D. student in statistics at U.C. Berkeley, she interned as a data scientist at Pinterest and is interested in venture capital.

Original. Reposted with permission.

Related: