Deep Learning in Neural Networks: An Overview

This post summarizes Schmidhuber's now-classic (and still relevant) 35 page summary of 900 deep learning papers, giving an overview of the state of deep learning as of 2014. A great introduction to a great paper!

Supervised and Unsupervised Learning

Supervised learning assumes that input events are independent of earlier output events (which may affect the environment thorugh actions causing subsequent perceptions). This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning.

Early supervised NNs were essentially variants of linear regression methods going back to at least the early 1800s! In 1979, the Neocognitron was the first artificial NN that deserved the attribute deep, based on neurophysiological insights from studies of the visual cortex of cats carried out in the 1960s.

It introduced convolutional NNs (today often called CNNs or convnets), where the (typically rectangular) receptive field of aconvolutional unit with a given weight vector (a filter) is shifted step-by-step across a 2-dimensional array of input values, such as the pixels of an image (usually there are several such filters). The resulting 2D array of subsequest activation events of this unit can then provide inputs to higher-level units, and so on… The Neocognitron is very similar to the architecture of modern, contest-winning, purely supervised, feedforward, gradient-based Deep Learners with alternating convolutional and downsampling layers.

The Neocognitron did not use backpropagation (BP), the first neural-network specific application of efficient backpropagation was described in 1981. For weight-sharing FNNs or RNNs with activation spreading through some differentiable function ft, a single iteration of gradient descent through backpropagation computes changes of all the weights wi. Forward and backward passes are reiterated until sufficient performance is reached.

The backpropagation algorithm itself is wonderfully simple. In the description below, dt is the desired target output value and v(t,k) is a function giving the index of the weight that connects t and k. Each weight wi is associated with a real-valued variable Δi initialized to 0.

Deep learning algo

As of 2014, this simple BP method is still the central learning algorithm for FNNs and RNNs. Notably, most contest-winning NNs up to 2014 did not augment supervised BP by some sort of unsupervised learning…

In 1989 backpropagation was combined with Neocognitron-like weight sharing convolutional neural layers with adaptive connections….

This combination, augmented by Max-Pooling, and sped-up on graphics cards has become an essential ingredient of many modern, competition-winning, feedforward, visual Deep Learners. This work also introduced the MNIST data set of handwritten digits, which over time has become perhaps the most famous benchmark of machine learning.

In Max Pooling, a 2-dimensional layor or array of unit activations is partitioned into smaller rectangular arrays. Each is replaced in a downsampling layer by the activation of its maximally active unit.


Through the 1980’s, it seem that although BP allows for deep problems in principle, it seemed only to work for shallow problems. The reason for this was only fully understood in 1991, via Hochreiter’s diploma thesis:

Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions, cumulative backpropagated error signals either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth, or they explode. This is also known as the long time lag problem.

This is the Fundamental Deep Learning Problem, and several ways of partially overcoming it have been explored over the years.

  1. Unsupervised pre-training can facilitate subsequent supervised credit assignment through BP.
  2. LSTM-like networks (described shortly) alleviate the problem though a special architecture unaffected by it
  3. Use of GPUs with much more computing power allow propagating errors a few layers further down within reasonable time, even in traditional NNs. “That is basically what is winning many of the image recognition competitions now, although this does not really overcome the problem in a fundamental way.”
  4. Hessian-free optimization (an advanced form of gradient descent) can alleviate the problem for FNNs and RNNs.
  5. The space of weight matrices can be searched without relying on error gradients at all, thus avoiding the problem. Random weight guessing sometimes works better than more sophisticated methods! Other alternatives include Universal Search and the use of linear methods.

A working Very Deep Learner of 1991 could perform credit assignment across hundreds of nonlinear operators or neural layers by using unsupervised pre-training for a hierarchy of RNNs. The basic idea is still relevant today. Each RNN is trained for a while in unsupervised fashion to predict its next input. From then on, only unexpected inputs (errors) convey new information and get fed to the next higher RNN which thus ticks on a slower, self-organising time scale. It can easily be shown that no information gets lost. It just gets compressed…

The Supervised Long Short-Term Memory) LSTM RNN could eventually perform similar feats to this deep RNN hierarchy but without needing any unsupervised pre-training.

The basic LSTM idea is very simple. Some of the units are calledConstant Error Carousels (CECs). Each CEC uses as an activation function f, the identity function, and has a connection to itself with a fixed weight of 1.0. Due to f’s constant derivative of 1.0 (d/dx f(x) = x is 1), errors backpropagated through a CEC cannot vanish or explode but stay as they are unless they ‘flow out’ of the CEC to other, typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some with multiplicative activation functions) needed for learning nonlinear behavior…. CECs are the main reason why LSTM nets can learn to discover the importance of (and memorize) events that happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal time lags of 10 steps.

Speech recognition up until the early 2000s was dominated by Hidden Markov Models combined with FNNs. But when trained from scratch LSTM obtained results comparable to these system. By 2007, LSTM was outperforming them. “Recently, LSTM RNN / HMM hybrids obtained the best known peformance on medium-vocabulary and large-vocabularly speech recognition.” LSTM RNNs have also won several international pattern recognition competitions and set numerous benchmark records on large and complex data sets.

Many competition winning deep learning systems today are either stacks of LSTM RNNs trained using Connectionist Temporal Classification (CTC), or GPU-based Max-Pooling CNNs (GPU_MPCNNs). CTC is a gradient-based method for finding RNN weights that maximize the probability of teacher-given label sequences, given (typically much longer and higher-dimensional) streams of real-valued input vectors.

An ensemble of GPU-MPCNNs was the first system to achieve superhuman visual pattern recognition in a controlled competition, namely, the IJCNN 2011 traffic sign recognition contest in San Jose… The GPU-MPCNN ensemble obtained 0.56% error rate and was twice better than human test subects, three times better that the closest artificial NN competitor, and six times better than the best non-neural method.

An ensemble of GPU-MPCNNs also was the first to achieve human competitive performance (around 0.2%) on MNIST in 2012. In the same year, GPU-MPCNNs achieved the best results on the ImageNet classification benchmark, and in a visual object detection contest – the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images.

Such biomedical applications may turn out to be among the most important applications of DL. The world spends over 10% of GHP on healthcare (> 6 trillion USD per year), much of it on medical diagnosis through expensive experts. Partial automation of this could not only save lots of money, but also make expert diagnostics accessible to many who currently cannot afford it.