arXiv Paper Spotlight: Why Does Deep and Cheap Learning Work So Well?

The recent paper at hand approaches explaining deep learning from a different perspective, that of physics, and discusses the role of "cheap learning" (parameter reduction) and how it relates back to this innovative perspective.



Why does deep learning work so well? And... cheap learning?

arXiv

A recent paper by Henry W. Lin (Harvard) and Max Tegmark (MIT), titled "Why does deep and cheap learning work so well?" looks to examine from a different perspective what it is about deep learning that makes it work so well. It also introduces (at least, to me) the term "cheap learning."

First off, to be clear, "cheap learning" does not refer to using a low end GPU; instead, the following explains its relationship to parameter reduction:

[A]lthough well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can be approximated through "cheap learning" with exponentially fewer parameters than generic ones, because they have simplifying properties tracing back to the laws of physics.

This central idea of this paper is that neural network success owes as much to physics as it does to mathematics (perhaps more), and that simplistic physics functions owing to concepts such as symmetry, locality, compositionality, and polynomial log-probability can be viewed similarly to deep learning's relationship with the reality which it seeks to model. You may have heard something about this in September; this is the paper on which said news was based.

Pyramid

More from the abstract:

We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to renormalization group procedures. We prove various "no-flattening theorems" showing when such efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss: flattening even linear functions can be costly, and flattening polynomials is exponentially expensive; we use group theoretic techniques to show that n variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.

Information loss

With a bit of maths and a few theorems, the paper proceeds as follows: presents "shallow" neural network results, with a handful of layers; demonstrates how increasing network depth provides polynomial or exponential efficiency gains without adding expressivity; summarizes conclusions and discusses a technical point about renormalization and deep learning.

An interesting excerpt from the paper's conclusion:

Whereas previous universality theorems guarantee that there exists a neural network that approximates any smooth function to within an error ε, they cannot guarantee that the size of the neural network does not grow to infinity with shrinking ε or that the activation function σ does not become pathological. We show constructively that given a multivariate polynomial and any generic non-linearity, a neural network with a fixed size and a generic smooth activation function can indeed approximate the polynomial highly efficiently.

The abstract can be found here, while this is a direct link to the paper.

Related: