Deep Learning Breakthrough: a sub-linear deep learning algorithm that does not need a GPU?

Deep Learning sits at the forefront of many important advances underway in machine learning. With backpropagation being a primary training method, its computational inefficiencies require sophisticated hardware, such as GPUs. Learn about this recent breakthrough algorithmic advancement with improvements to the backpropgation calculations on a CPU that outperforms large neural network training with a GPU.

comments

By Anshumali Shrivastava, Rice University.

We all know that deep neural networks (DNNs) are spearheading the current AI revolution. The backpropagation algorithm is a commonly used method for training DNNs. Backpropagation consists of forward and backward pass over the network, which can be written as a sequence of matrix multiplications. GPUs (Graphical Processing Units) have a unique advantage here because they have thousands of processing cores optimized for Matrix multiplication. They can perform an order of magnitude faster matrix multiplications.

Wasteful Computations in Backpropagation

Backpropagation is not an efficient algorithm. Prof. Geoffrey Hinton, one of the pioneers of backpropagation, has identified several of its vital shortcomings in an interview. The core idea behind the SLIDE algorithm is to minimize wasteful computations in backpropagation. Imagine a vast neural network with hundreds of millions of parameters. As training progresses, different neurons specialize and focus (with large activations) on very different characteristics of the data. The result is a network where only a few neurons are important for any given input, as demonstrated by the famous cat neuron article. We call this characteristic adaptive sparsity.

Backpropagation is oblivious to adaptive sparsity. Instead, backpropagation updates millions of parameters using full matrix multiplication. Most of these updates are unnecessary.

Since 2013, adaptive sparsity has been exploited in several ways to improve generalization. For example, a NeurIPS 2013 paper showed we could limit the weight updates to a tiny subset of the neurons in each layer. The subset was selected using adaptive sampling. The sampling is adaptive in the sense that it favors neurons with large values of activation during sampling. The authors argue that sparse adaptive updates support better generalization leading to better accuracy.

Unfortunately, these sparse updates are far from being computationally efficient. We need to compute all the activations for each input to determine which neurons to update. Since every neuron must be accessed, the number of operations is linear in the number of parameters. Thus, if we have one million parameters, we must perform one million arithmetic operations for every input.

Detour: The Fundamental Puzzle of Inner Product Sampling

Algorithms for efficient sparse updated are locked behind an algorithmic puzzle. The puzzle is: “How can we sample neurons with large activations without computing all of the activations? A key observation is that the activations are a monotonic function of the inner product between neuron weights and the input. Thus, we can reduce the problem to: “How can we efficiently sample weight vectors that will have a large inner product with the input?”

The problem seems a little absurd because we need to know the values of the inner products to tell which ones are large. The problem is analogous to a similarity search problem in disguise, where the inner product is the similarity measure, and the input is the query. In web search, it is well known that we can find documents that are similar (or relevant) to the query without linear computations using indexing.

Hash Tables for Adaptive Sampling

A hash function h(x) maps an input x to an integer, called a hash value or fingerprint. Let us assume that h(x) has a special property: given two vectors, the hash values of both vectors are the same with probability Pr⁡(h(x)=h(y))=g(⟨x,y⟩), where g is a monotonic function. This property describes a locality sensitive hash (LSH) function. In simpler terms, if two vectors x and y have a large inner product, then their corresponding hash values h(x) and h(y) are very likely to be identical. If the vectors have a small inner product, then h(x) and h(y) will most likely be different. We showed how to construct a hash function with this property in this paper, which received the outstanding paper award in NIPS 2014 (now NeurIPS).

In early 2016, we showed that (paper) our functions lead to an efficient sampling algorithm for neurons. The idea is to index all the neurons into a hash table using h, as shown in Figure 2. We store a reference to neuron hash at the memory location pointed to by its fingerprint . After preprocessing the neurons, we can efficiently sample neurons for any input x. We simply go to the memory location h(x) and choose neurons from there. This process ensures hash collisions, and hence we only select neurons that are likely to have large inner products. Our sampling process is provably adaptive to the activation values without needing to explicitly compute all activations.

Figure 1: Illustration of Hash tables for sampling neurons. The HASH function satisfied the LSH property.

Subsequent work has shown that probabilistic hash tables can lead to O(1) importance sampling for efficient estimation of partition functions, kernels, and stochastic gradients.

Figure 2: SLIDE Algorithm Illustration. Every layer has hash tables for efficiently identifying neurons with high activation. for every input, we only need hash computation and a few memory probes to obtain a very sparse snapshot of the network.

The SLIDE Algorithm: Sparse Backpropagation with Hash Tables

The SLIDE algorithm (Figure 2) uses hash tables to select neurons during training. Complete implementation details are available in our paper.

SLIDE begins by initializing the neural network parameters and pre-indexing all of the neurons into LSH hash tables.

To train the network with a given input, we first compute the LSH hash value (H1) of the input. Then, we look up neurons with H1 as the address (memory location) and sample neurons from the corresponding bucket. We only compute the activations for these neurons, and we set the activations for other neurons to 0. In the next layer, we use the sparse activation output of the previous layer as the input. The input is sparse because the only non-zero activations are from the selected neurons. We obtain a new hash value H2 for our new input and repeat the sampling process for the next layer. For every input, we only need a few hash values and memory lookups to select a sparse set of important nodes in the network. We only feedforward and backpropagate on the selected neurons. Finally, we update the position of the selected neurons in the hash tables using the updated weights.

During training, all operations are of the order of active neurons (sub-linear) rather than the total number of neurons (linear). In our experiments, we show that only 0.5% of the neurons are needed to get the same accuracy as full backpropagation. While SLIDE introduces a small overhead to maintain the hash tables, our operations are around 200x less expensive when compared to standard backpropagation.

Parallel Gradient Updates

Since only 0.5% of the nodes are active for a given input, it is unlikely that two unrelated inputs x and y will share many selected neurons. As a result, we can perform asynchronous gradient updates in parallel. The HOGWILD paper shows that this idea is both theoretically sound and practical for gradient descent.

To summarize, the SLIDE algorithm requires orders of magnitude fewer arithmetic operations on each input. When combined with asynchronous data-parallel gradient updates, we have a highly efficient training process. In contrast, standard backpropagation on a GPU performs a costly arithmetic operation. GPU parallelization is limited to the scaling up matrix multiplications rather than data-parallel processes.

Figure 3: Training time comparison of SLIDE (in red) against TF-GPU (in blue) and TF-CPU (in black). The x-axis is on the log scale. SLIDE was implemented in C++ and experiments were done on the same CPU as TF-CPU. TF-CPU used the most optimized Intel’s version (MKL-DNN).

Comparison with TensorFlow Backpropagation on CPU and V100 GPU

SLIDE was compared on large, fully connected neural networks with more than 100 million parameters, commonly encountered in commercial recommendation engines. We highlight one of the training plots from the paper. The code, along with benchmark scripts are publicly available on Github. The data is publicly available Amazon-670k dataset from Kaggle.

Figure 3. illustrates the complete progress of training with SLIDE (Red Curve) running on a 44-core Intel Xeon CPU (2 sockets of 22 cores each). SLIDE is compared with the most optimized implementations of backpropagation available via TensorFlow on V100 GPU (TF-GPU) (Blue curve) as well as Intel optimized TensorFlow on the same 44-core CPU (TF-CPU) (Black curve). The running time is on the log scale. TF-CPU is slow compared to TF-GPU, indicating the advantages of GPUs for backpropagation on this large workload. The exciting part is that a C++ implementation of the efficient SLIDE algorithm, on the same CPU, significantly outperforms the GPU acceleration. SLIDE converges 3.5x faster than TensorFlow on V100 GPU (1 hour vs. 3.5 hours).

Bio: Anshumali Shrivastava (@Anshumali_) is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. In 2018, Science news named him one of the Top-10 scientists under 40 to watch. He is a recipient of the National Science Foundation CAREER Award, a Young Investigator Award from the Air Force Office of Scientific Research, and a machine learning research award from Amazon. He has won numerous paper awards, including Best Paper Award at NIPS 2014 and Most Reproducible Paper Award at SIGMOD 2019. His work has been featured in several media outlets including the New York Times, IEEE Spectrum, Science News, Engadget, and ArsTechnica.

Related: