Parallelism in Machine Learning: GPUs, CUDA, and Practical Applications

The lack of parallel processing in machine learning tasks inhibits economy of performance, yet it may very well be worth the trouble. Read on for an introductory overview to GPU-based parallelism, the CUDA framework, and some thoughts on practical implementation.

Importantly, hosts and devices possess their own memory spaces, independent of one another. A CUDA device shares a single global memory space. The first requirement for launching kernels and spawning numerous device threads for computation is to copy the required data from host to device memory. Once computation is complete, it is necessary to copy the results back in the reverse direction. This is all facilitated via CUDA extensions, and occurs at a heavily abstracted layer from the programmer’s point of view.

When managing device-side memory, proper allocation of blocks to kernels is critical; too few leads to a lack of computational power, while too many results in wasted threads, which could have been assigned to other simultaneously-executing kernels. As a specific example, this could translate to too few threads allocated to a particular fold during k-fold cross-validation model-building, leading to a much longer validation process than intended.

Conversely, it could result in too many threads being assigned during k-fold cross-validation model building, leaving large numbers of unused threads and extending the amount of time required for all folds to complete their model validation.

Fortunately, the management of device memory, including the number of threads assigned to blocks, and, ultimately, to kernels, is user-definable (within some upper bounds, such as a maximum of 1024 threads per block). CUDA provides some clever ways of semi-automation of this management as well, allowing memory management functions to take mathematical expressions as arguments, so that, for example, a kernel can, upon execution, calculate the size of a data structure such as an array or a matrix and allocate the amount and dimensions of memory that would be appropriate for its computations.

CUDA Grid Organization
Fig. 4: CUDA Grid Organization.

Consider matrix multiplication, an aspect of linear regression which we propose to parallelize, and its implementation on the CUDA architecture. Proceeding at a high level, without regard to matrix sizes, we can say that we have 2 matrices to multiply, M and N, and that the result will be stored in matrix P. First, we allocate space for matrices M and N in device global memory, as well as space for the resulting matrix P. We then copy matrices M and N to the device.

Assuming for simplicity that all matrices fit into a single block, we will have each block thread compute an element of P. To accomplish this, each thread loads a row of M and a column of N, computes the dot product, and stores it in the appropriate element of P. As each of these dot products are computed in parallel, the total time it will take to perform the matrix multiplication is the time that it takes to perform a single dot product computation. Once complete, matrix P is then copied from device memory back to host memory, where it can be further used by serial code, if necessary. Typically such a kernel operation would be followed by deallocation of device memory.

This is a high level overview. In practice, additional tasks need to be performed, such as determining block sizes, as stated above. It is also a single, specific example; however, the memory management and device computation techniques, while they will be, by necessity, quite different by algorithmic situation, generalize to our various tasks: identify parallelizable computations, allocate device memory, copy data to device, perform parallelized computation, copy result back to host, continue with serial code. Note that the memory allocation and data copying overhead can easily become a bottleneck here, with any potential computational time savings stymied by these processes.

Algorithmic Applications in Machine Learning

Given the proper data, knowledge of algorithm implementation, and ambition, there is no limit to what you can attempt with parallel processing in machine learning. Of course, and as mentioned above, identifying parallelizable portions of code is the most difficult task, and there may not be any in a given algorithm.

A good place to start is matrix multiplication, as treated above, which is a well-used method for implementing linear regression algorithms. An implementation of linear regression on GPGPU can be found here. The paper "Performance Improvement of Data Mining in Weka through GPU Acceleration" notes speed increases, and the paper provides some additional insight into conceptualizing parallelism algorithmically.

Another common task used in machine learning which is ripe for parallelization is distance calculation. Euclidean distance is a very common metric which requires calculation over and over again in numerous algorithms, including k-means clustering. Since the individual distance calculations of successive iterations are not dependent on other calculations of the same iteration, these calculations could be performed in parallel (if we forget our memory management overhead as a potential bottleneck to contend with).

k-fold Cross-validation
Fig. 5: k-fold Cross-validation.

While these aforementioned shared statistical tasks could benefit from efficiency of execution, there is an additional aspect of the machine learning data flow which could potentially allow for even more significant gains. A common evaluation technique regularly employed in machine learning model validating is k-fold cross-validation, involving the intensive, not-necessarily sequential processing of dataset segments. k-fold cross-validation is a deterministic method for model building, achieved by leaving out one of k segments, or folds, of a dataset, training on all k-1 segments, and using the remaining kth segment for testing; this process is then repeated k times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible.

This extensive model validating, when performed sequentially, can be relatively time-consuming, especially when each fold is paired with a computationally expensive algorithm task such as linear regression matrix multiplication. As k-fold cross-validation is a standard method for predicting a given machine learning algorithm’s error rate, attempting to increase the speed by which this prediction occurs seems particularly worthy of effort. A very high level view of doing so is implied previously in this article.

A consideration for those using Python goes beyond algorithm design, and relates to optimized native code and runtime comparisons with parallel implementations. While beyond the scope of this discussion, you may want to read more about this topic here.

Thinking algorithmically is necessary to leverage finite computational resources in any situation, and is no different with machine learning. With some clever thinking, an in-depth understanding of what you are attempting, and a collection of tools and their documentation, you never know what you may be able to achieve. Trust me when I say that, after doing some related work in grad school, finding opportunities to experiment with take much less time than you might think. Familiarize yourself with a code base, read a few tutorials, and get to work.

Parallel computing, GPUs, and traditional machine learning can be good friends, and I challenge you to dig deeper and discover the potential for yourself.