Deep Learning 101: Demystifying Tensors
Many deep-learning systems available today are based on tensor algebra, but tensor algebra isn’t tied to deep-learning. It isn’t hard to get started with tensor abuse but can be hard to stop.
By Ted Dunning, Chief Applications Architect at MapR Technologies.
Tensors and new machine learning tools such as TensorFlow are hot topics these days, especially among people looking for ways to dive into deep learning. Turns out, when you look past all the buzz, there’s really some fundamentally powerful, useful and usable methods that take advantage of what tensors have to offer, and not just for deep learning situations. Here’s why.
If computing can be said to have traditions, then numerical computing using linear algebra is one of the most venerable. Packages like LINPACK and the later LAPACK, are now very old, but are still going strong. At its core, linear algebra consists of fairly simple and very regular operations involving repeated multiplication and addition operations on one- and two-dimensional arrays of numbers (often called vectors and matrices in this context) and it is tremendously general in the sense that many problems can be solved or approximated by linear methods. These range from rendering the images in computer games to nuclear weapon design as well for as a huge range of other applications between these extremes.
Partly because of their regularity and because they can be implemented in ways that use lots of parallelism, computers can evaluate programs written using linear algebra at enormous speeds. In terms of raw potential performance, the evolution from the Cray-1 to today’s GPU’s represents an increase in performance of more than 30,000 times and when you consider clusters with lots of GPU’s the performance potential is roughly a million times what was once the fastest computer on the planet at a tiny fraction of the cost.
The historical pattern has been, however, that we have to move to higher and higher levels of abstraction in order to take advantage of new processors. The Cray-1 and its vector-oriented follow-ons required that programs be rewritten to use vector operations (like the dot product) in order to realize full performance. Later machines have required that algorithms be formulated in terms of matrix-vector operations or even matrix-matrix operations to push hardware as hard as it would go.
We are at just such a threshold right now. The difference is that there isn’t anywhere to go beyond matrix-matrix operations. That is, there isn’t anywhere else to go using linear algebra.
But we don’t have to restrict ourselves to linear algebra. As it turns out, we can move up the mathematical food chain a bit. It has long been known that there are bigger fish in the mathematical abstraction sea than just matrices. One such candidate is called a tensor. Tensors figure prominently in the mathematical underpinnings of general relativity and are fundamental to other branches of physics as well. But just as the mathematical concepts of matrices and vectors can be simplified down to the arrays we use in computers, tensors can also be simplified and represented as multidimensional arrays and some related operations. Unfortunately, things aren’t quite as simple as they were with matrices and vectors, largely because there isn’t an obvious and simple set of operations to be performed on tensors like there were with matrices and vectors.
There is some really good news, though. Even though we can’t write just a few operations on tensors, we can write down a set of patterns of operations on tensors. That isn’t quite enough, however, because programs written in terms of these patterns can’t be executed at all efficiently as they were written. The rest of the good news is that our inefficient but easy to write programs can be transformed (almost) automatically into programs that do execute pretty efficiently.
Even better, this transformation can be implemented without having to build yet another new computer language. What is done instead is a diabolical trick: in Tensorflow when we write code like this:
What really happens is that a data structure like the one shown in Figure 1 is constructed:
Figure 1. The code above is translated into a data structure which can be restructured and converted into machine executable form. Translating the code into a user-visible data structure allows the program we wrote to be rewritten for more efficient execution or to allow a derivative to be computed so advanced optimizers can be used.
This data structure isn’t actually executed in the program that we have above. Because of this, there is a chance for Tensorflow to rewrite the data structure into much more efficient code before we actually try to run it. That might involve small or large structuring of what we thought we were asking the computer to do. It might also include generation of actual executable code for the CPU of the machine we are using, for the cluster we are using or for any GPU’s we have handy.
The cool thing about this is that we can write very simple programs that achieve amazing results.
But this is only the beginning.
Doing Something Useful (but Different)
Tensorflow and systems like it are all about taking programs that describe a machine learning architecture (such as a deep neural network) and adjusting the parameters of that architecture to minimize some sort of error value. They do this by not only creating a data structure representing our program, but they also produce a data structure that represents the gradient of the error value with respect to all of the parameters of our model. Having such a gradient function makes the optimization go much more easily.
But here is the kicker. You can write programs using TensorFlow or Caffe or any of the other systems that work basically the same way. But the programs that you write don’t have to optimize machine learning functions. You can write programs that optimize all kinds of programs if the ones you write use the tensor notation provided by the package that you have chosen. The automatic differentiation and the state-of-the-art optimizers and the compilation down to efficient GPU code all still work in your favor.
As a quick example, Figure 2 shows a simple model of home energy use.
Figure 2. A plot of the daily energy used by a home (the circles) as a function of temperature on the horizontal axis. A piece-wise linear model of the energy usage has been superimposed over the usage data. The parameters of the model would normally be a matrix, but when we talk about a million models, we can use a tensor.
This figure shows a single home’s energy usage and a model of that usage. To get a single model is no great thing, but to find this model, I had to write some real code and then that code had to be run across millions of homes to get models for each. With TensorFlow, we could create models for all of the houses at once, and we could use a more efficient optimizer than the one I used originally to get this model. The result is a million models being optimized all at once with much higher efficiency than was possible with my original program. I could theoretically have hand-optimized my code and could have hand-derived a derivative function, but the time required to do so, and, more importantly, the time required to debug the result made that impossible within the time I had to do the model.
Examples like this show how tensor based computational systems like TensorFlow (or Caffe or Theano or Mxnet or whatever your favorite is) can be used for optimization problems that are very, very different from deep learning.
It may happen that the best use of machine learning software (for you) is to not do machine learning at all.
Bio: Ted Dunning is Chief Applications Architect at MapR Technologies and is active in the open source community. He has been an early employee in a number of successful startups including MapR.
He currently serves on the Board of Directors of the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects. He developed the t-digest algorithm used to estimate extreme quantiles.