KDnuggets Home » News » 2019 » Jul » Tutorials, Overviews » Here’s how you can accelerate your Data Science on GPU

Here’s how you can accelerate your Data Science on GPU


Data Scientists need computing power. Whether you’re processing a big dataset with Pandas or running some computation on a massive matrix with Numpy, you’ll need a powerful machine to get the job done in a reasonable amount of time.



Data Scientists need computing power. Whether you’re processing a big dataset with Pandas or running some computation on a massive matrix with Numpy, you’ll need a powerful machine to get the job done in a reasonable amount of time.

Over the past several years, Python libraries commonly used by Data Scientists have gotten pretty good at leveraging CPU power.

Pandas, with its underlying base code written in C, does a fine job of being able to handle datasets that go over even 100GB in size. And if you don’t have enough RAM to fit such a dataset, you can always use the convenient chunking functions that can process the data one piece at a time.

 

GPUs vs CPUs: Parallel Processing

 
With massive data, a CPU just isn’t going to cut it.

A dataset that goes over 100GB in size is going to have many many data points, within the millions or even billions ballpark range. With that many points to process, it doesn’t matter how fast your CPU is, it simply doesn’t have enough cores to do efficient parallel processing. If your CPU has 20 cores (which would be fairly expensive CPU), you can only process 20 data points at a time!

CPUs are going to be better in tasks where clock-speed is more important — or you simply don’t have a GPU implementation. If there is a GPU implementation for the process you are trying to perform, then a GPU will be far more effective if that task can benefit from parallel processing.

figure-name

How a Multi-core system can process data faster. For a single core system (left), all 10 tasks go to a single node. For the dual-core system (right), each node takes on 5 tasks, thereby doubling the processing speed

Deep Learning has already seen its fair share of leveraging GPUs. Many of the convolution operations done in Deep Learning are repetitive and as such can be greatly accelerated on GPUs, even up to 100s of times.

Data Science today is no different as many repetitive operations are performed on large datasets with libraries like Pandas, Numpy, and Scikit-Learn. These operations aren’t too complex to implement on the GPU either.

Finally, there’s a solution.

 

GPU Acceleration with Rapids

 
Rapids is a suite of software libraries designed for accelerating Data Science by leveraging GPUs. It uses low-level CUDA code for fast, GPU-optimized implementations of algorithms while still having an easy to use Python layer on top.

The beauty of Rapids is that it’s integrated smoothly with Data Science libraries — things like Pandas dataframes are easily passed through to Rapids for GPU acceleration. The diagram below illustrates how Rapids achieves low-level acceleration while maintaining an easy to use top-layer.

figure-name

Rapids leverages several Python libraries:

  • cuDF —Python GPU DataFrames. It can do almost everything Pandas can in terms of data handling and manipulation.
  • cuML — Python GPU Machine Learning. It contains many of the ML algorithms that Scikit-Learn has, all in a very similar format.
  • cuGraph — Python GPU graph processing. It contains many common graph analytics algorithms including PageRank and various similarity metrics.

 

A Tutorial for how to use Rapids

 

Installation
Now you’ll see how to use Rapids!

To install it, head on over to the website where you’ll see how to install Rapids. You can install it directly on your machine through Conda or simply pull the Docker container.

When installing, you can set your system specs such as CUDA version and which libraries you would like to install. For example, I have CUDA 10.0 and wanted to install all the libraries, so my install command was:

conda install -c nvidia -c rapidsai -c numba -c conda-forge -c pytorch -c defaults cudf=0.8 cuml=0.8 cugraph=0.8 python=3.6 cudatoolkit=10.0


Once that command finishing running, you’re ready to start doing GPU-accelerated Data Science.

Setting up our data
For this tutorial, we’re going to go through a modified version of the DBSCAN demo. I’ll be using the Nvidia Data Science Work Station to run the testing which came with 2 GPUs.

DBSCAN is a density-based clustering algorithm that can automatically classify groups of data, without the user having to specify how many groups there are. There’s an implementation of it in Scikit-Learn.

We’ll start by getting all of our imports setup. Libraries for loading data, visualising data, and applying ML models.

import os
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_circles


The make_circles functions will automatically create a complex distribution of data resembling two circles that we’ll apply DBSCAN on.

Let’s start by creating our dataset of 100,000 points and visualising it in a plot:

X, y = make_circles(n_samples=int(1e5), factor=.35, noise=.05)
X[:, 0] = 3*X[:, 0]
X[:, 1] = 3*X[:, 1]
plt.scatter(X[:, 0], X[:, 1])
plt.show()


figure-name

DBSCAN on CPU
Running DBSCAN on CPU is easy with Scikit-Learn. We’ll import our algorithm and setup some parameters.

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.6, min_samples=2)


We can now apply DBSCAN on our circle data with a single function call from Scikit-Learn. Putting a %%time before our function tells Jupyter Notebook to measure its run time.

%%time
y_db = db.fit_predict(X)


For those 100, 000 points, the run time was 8.31 seconds. The resulting plot is shown below.

figure-name

Result of running DBSCAN on the CPU using Scikit-Learn

DBSCAN with Rapids on GPU
Now let’s make things faster with Rapids!

First, we’ll convert our data to a pandas.DataFrame and use that to create a cudf.DataFrame. Pandas dataframes are converted seamlessly to cuDF dataframes without any change in the data format.

import pandas as pd
import cudf

X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
X_gpu = cudf.DataFrame.from_pandas(X_df)


We’ll then import and initialise a special version of DBSCAN from cuML, one that is GPU accelerated. The function format of the cuML version of DBSCAN is the exact same as that of Scikit-Learn — same parameters, same style, same functions.

from cuml import DBSCAN as cumlDBSCAN

db_gpu = cumlDBSCAN(eps=0.6, min_samples=2)


Finally, we can run our prediction function for the GPU DBSCAN while measuring the run time.

%%time
y_db_gpu = db_gpu.fit_predict(X_gpu)


The GPU version has a run time of 4.22 seconds — almost a 2X speedup. The resulting plot is the exact same as the CPU version too, since we are using the same algorithm.

figure-name

Result of running DBSCAN on the GPU using cuML

 

Getting super speed with Rapids GPU

 
The amount of speedup we get from Rapids depends on how much data we are processing. A good rule of thumb is that larger datasets will benefit from GPU acceleration. There is some overhead time associated with transferring data between the CPU and GPU — that overhead time becomes more “worth it” with larger datasets.

We can illustrate this with a simple example.

We’re going to create a Numpy array of random numbers and apply DBSCAN on it. We’ll compare the speed of our regular CPU DBSCAN and the GPU version from cuML, while increasing and decreasing the number of data points to see how it effects our run time.

The code below illustrates this test:

import numpy as np

n_rows, n_cols = 10000, 100
X = np.random.rand(n_rows, n_cols)
print(X.shape)

X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
X_gpu = cudf.DataFrame.from_pandas(X_df)

db = DBSCAN(eps=3, min_samples=2)
db_gpu = cumlDBSCAN(eps=3, min_samples=2)

%%time
y_db = db.fit_predict(X)

%%time
y_db_gpu = db_gpu.fit_predict(X_gpu)


Check out the plot of the results from Matplotlib down below:

figure-name

The amount of rises quite drastically when using the GPU instead of CPU. Even at 10,000 points (far left) we still get a speedup of 4.54X. On the higher end of things, with 10,000,000 points we get a speedup of 88.04X when switching to GPU!

 

Like to learn?

 
Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science! Connect with me on LinkedIn too!

 

Recommended Reading

 
Want to learn more about Data Science? The Python Data Science Handbook book is the best resource out there for learning how to do real Data Science with Python!
And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone! As an Amazon Associate I earn from qualifying purchases.

 
Bio: George Seif is a Certified Nerd and AI / Machine Learning Engineer.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy