The hard thing about deep learning
It’s easy to optimize simple neural networks, let’s say single layer perceptron. But, as network becomes deeper, the optmization problem becomes crucial. This article discusses about such optimization problems with deep neural networks.
By Reza Zadeh, Founder and CEO of Matroid.
At the heart of deep learning lies a hard optimization problem. So hard that for several decades after the introduction of neural networks, the difficulty of optimization on deep neural networks was a barrier to their mainstream usage and contributed to their decline in the 1990s and 2000s. Since then, we have overcome this issue. In this post, I explore the “hardness” in optimizing neural networks and see what the theory has to say. In a nutshell: the deeper the network becomes, the harder the optimization problem becomes.
This post originally appeared on OReilly.com, organizers of Strata Hadoop World. Republished with permission.
Strata + Hadoop World  March 13–16, 2017  San Jose, CA
“One of the most valuable events to advance your career.”
Strata + Hadoop World is a rich learning experience at the intersection of data science and business. Thousands of innovators, leaders, and practitioners gather to develop new skills, share best practices, and discover how tools and technologies are evolving to meet new challenges. Find out how big data, machine learning, and analytics are changing not only business, but society itself at Strata + Hadoop World. Save 20% on most passes with discount code PCKDNG.
Fig. Rastrigin Function.(source: Diegotorquemada on Wikimedia Commons).
The simplest neural network is the singlenode perceptron, whose optimization problem is convex. The nice thing about convex optimization problems is that all local minima are also global minima. There is a rich variety of optimization algorithms to handle convex optimization problems, and every few years a better polynomialtime algorithm for convex optimization is discovered. Optimizing weights for a single neuron is easy using convex optimization (see graphic below). Let’s see what happens when we go past a single neuron.
Fig. Left: a convex function. Right: a nonconvex function. It is much easier to find the bottom of the surface in the convex function than the nonconvex surface. (Source: Reza Zadeh)
The next natural step is to add many neurons, while keeping a single layer. For the singlelayer, nnode perceptron neural network, if there exist edge weights so that the network correctly classifies a given training set, then such weights can be found in polynomial time in n using linear programming, which is also a special subset of convex optimization. A natural question arises: can we make similar guarantees about deeper neural networks, with more than one layer? Unfortunately not.
To provably solve optimization problems for general neural networks with two or more layers, the algorithms that would be necessary hit some of the biggest open problems in computer science. So, we don’t think there’s much hope for machine learning researchers to try to find algorithms that are provably optimal for deep networks. This is because the problem is NPhard, meaning that provably solving it in polynomial time would also solve thousands of open problems that have been open for decades. Indeed, in 1988 J. Stephen Judd shows the following problem to be NPhard:
Given a general neural network and a set of training examples, does there exist a set of edge weights for the network so that the network produces the correct output for all the training examples?
Judd also shows that the problem remains NPhard even if it only requires a network to produce the correct output for just twothirdsof the training examples, which implies that even approximately training a neural network is intrinsically difficult in the worst case. In 1993, Blum and Rivest make the news worse: even a simple network with just two layers and three nodes is NPhard to train!
Theoretically, there is contrast of deep learning with many simpler models in machine learning, such as support vector machines and logistic regression, that have mathematical guarantees stating the optimization can be performed in polynomial time. With many of these simpler models, we are able to guarantee you won’t find any better model by running an optimization algorithm longer than polynomial time. But the optimization algorithms that exist for deep neural networks don’t afford such guarantees. You don’t know after you’ve trained a deep neural network whether, given your setup, this is the best model you could have found. So, you’re left wondering if you would have a better model if you kept on training.
Thankfully, we can get around these hardness results in practice very effectively: running typical gradient descent optimization methods gives good enough local minima that we can make tremendous progress on many natural problems, such as image recognition, speech recognition, and machine translation. We simply ignore the hardness results and run as many iterations of gradient descent as time allows.
It seems traditional theoretical results in optimization are heavyhanded – we can largely get past them by way of engineering and mathematical tricks, heuristics, adding more machines, and using new hardware such as GPUs. There is active research into teasing out why typical optimization algorithms work so well, despite the hardness results.
The reasons for the success of deep learning go far beyond overcoming the optimization problem. The architecture of the network, amount of training data, loss function, and regularization all play crucial roles in obtaining leading quality numbers in many machine learning tasks. In future posts, I will address some more modern theoretical results that cover these other aspects, to explain why neural networks work so well on a variety of tasks.
Bio: Reza Zadeh is an Adjunct Professor at Stanford University, and Founder and CEO at Matroid. His work focuses on machine learning, distributed computing, and discrete applied mathematics. He has served on the Technical Advisory Board of Microsoft and Databricks.
Related:
Top Stories Past 30 Days  


