Super Charge Python with Pandas on GPUs Using Saturn Cloud

Saturn Cloud is a tool that allows you to have 10 hours of free GPU computing and 3 hours of Dask Cluster computing a month for free. In this tutorial, you will learn how to use these free resources to process data using Pandas on a GPU. The experiments show that Pandas is over 1,000,000% slower on a CPU as compared to running Pandas on a Dask cluster of GPUs.

By Tyler Folkman, Head of Artificial Intelligence at BEN Group

Photo by Guillaume Jaillet on Unsplash


I once asked on LinkedIn what libraries people are most likely to import when first opening up a Jupyter Notebook.

Do you know what the number one response was?



The Pandas library is used extensively by data scientists, but the truth is it can often be quite slow. And if your data processing is slow it can really lead to a lot of delays in a project. If every time you want to generate a new feature for your model takes even 10 minutes, you find yourself sitting around just waiting (or doing other things :) ).



The time it takes for data to process and models to train is similar to the compiling time for programmers. And while you might enjoy some of your data processing breaks due to long processing times, I want to show you a better way.

I want to show you how you can easily avoid being over 1,000,000% slower when using Pandas.


The Setup

Before we dig into the experiments, let’s talk about the hardware we will be using.

To make this incredibly easy for you to follow along with, we will be using Saturn Cloud.

Source: Saturn Cloud


Saturn Cloud is a really slick platform that gives you access to:

  • 10 hours of free Jupyter per month (including GPU)
  • 3 hours of Dask per month (including GPU)
  • Deploy dashboards

You can use it for free here. Making it a great place to experiment with large-scale data processing!

When you start up a project on Saturn, you get access to two very important pieces of hardware. A Jupyter Server and a Dask Cluster.

Source: Saturn Cloud


Our Jupyter Server is running with 4 cores, 16 GB of RAM, and a single GPU. The Dask cluster is 3 workers with 4 cores, 16GB of RAM, and a single GPU.

The other amazing thing you get when selecting the RAPIDS base project is a Python kernel with all of NVIDIA’s RAPIDs libraries including CUDF, which allows us to run our Pandas processing on a GPU.

Getting this setup with a few clicks is no small win. Even if you have a computer with a GPU, getting the RAPIDs libraries working can be time-consuming because you have to make sure you have the right drivers and CUDA library.


The Data

Alright — now that we have our hardware setup, we need to talk about the dataset we will be using.

We will use the Yellow Taxi trip record data.

These data have 7,667,792 rows and 18 columns. Here is what you get when running info() on the data frame:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7667792 entries, 0 to 7667791
Data columns (total 18 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        int64         
 4   trip_distance          float64       
 5   RatecodeID             int64         
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
dtypes: datetime64[ns](2), float64(9), int64(6), object(1)
memory usage: 1.0+ GB

So, a decently sized dataset, but nothing too crazy. It comes in at about 1GB of memory usage.


The Function

Lastly, for our experiments, we need a function to run on our data. I selected a pretty simple function that creates a new feature for a potential model. Our function will calculate the total amount paid divided by the trip distance.

def calculate_total_per_mile(row):
     total_per_mile = row.total_amount / row.trip_distance
 except ZeroDivisionError:
     total_per_mile = 0
 return total_per_mile

When we find a zero value for trip_distance, we will just return a value of zero.

Note: You can easily vectorize this function as taxi_df[“total_per_mile”] = taxi_df.total_amount / taxi_df.trip_distance, which would be significantly faster. We are using a function instead so we can compare apply speeds for our experiments. If you’re curious, the vectorized version took about 50 ms.

Experiment #1 — Raw Pandas

For our first experiment, we are just going the read the data using raw Pandas. Here is the code:

taxi_df = pd.read_csv(
 “s3://nyc-tlc/trip data/yellow_tripdata_2019–01.csv”,
 parse_dates=[“tpep_pickup_datetime”, “tpep_dropoff_datetime”]

With the data read in, we can apply our function to all our rows:

taxi_df[‘total_per_mile’] = taxi_df.apply(lambda x: calculate_total_per_mile(x), axis=1)

It took 159,198 ms (2 minutes and 39.21 seconds) to run this calculation.

While that isn’t terribly slow, it is definitely slow enough that you notice the time it takes and it throws off your flow. I found myself sitting around waiting and that waiting time could easily tempt you to get distracted by email, Slack, or social media. And those types of distractions can really kill productivity beyond the time almost 3 minutes this calculation took to run.

Can we do better?


Experiment #2 — Parallel Pandas

Swifter is a library that makes it incredibly simple to use all the threads of your CPU when running Pandas apply.

Since apply is easily parallelized because you can just break the data frame into chunks for each thread, this should help. Here is the code:

taxi_df[‘total_per_mile’] = taxi_df.swifter.apply(lambda x: calculate_total_per_mile(x), axis=1)

Adding swifter to the mix brought our processing time to 88,690 ms (1 minute and 28.69 seconds). Raw pandas was 1.795 times slower and you definitely notice when running the code that it got faster. That being said, you still find yourself waiting for it to finish.


Experiment #3 — Pandas on a GPU

This is where things start to get interesting. Thanks to the cuDF library will allow us to run our function on the GPU. The cuDF library, “provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.”

To use the cuDF library, though, we have to change the format of our function slightly:

def calculate_total_per_mile(total_amount, trip_distance, out):
 for i, (ta, td) in enumerate(zip(total_amount, trip_distance)):
     total_per_mile = ta / td
     out[i] = total_per_mile

Now our function takes in the total_amounttrip_distance, and outOut is the array in which we will store the results. Our function then loops over all the values of total_amount and trip_distance from our data frame, calculates our total_per_mile, and stores the results in out.

We also have to change how we apply this function to our data frame:

taxi = taxi.apply_rows(calculate_total_per_mile,
     incols={‘total_amount’:’total_amount’,   ‘trip_distance’:’trip_distance’},
     outcols={‘out’: np.float64},

We now specify what the input columns are from our data frame and how they map to the function parameters. We also specify which parameter is the output (outcols) and what type of value it is (np.float64).

Note: reading in the data remains the same except you use cudf.read_csv() instead of pd.read_csv().

Using the cuDF library and leveraging our GPU takes the processing time down to 43 ms! That means raw pandas was 3,702 times slower! That is insane! At this point, the processing time doesn’t feel like a delay at all. You run the function and before you know it, you have results. Honestly, I was amazed by how much faster it was to run our processing on the GPU.

But! We have 1 more experiment to run to see if we can make it even faster.


Experiment #4— Pandas on a Multiple GPUs

Dask_cuDF is a library that we can use in order to leverage our dask cluster which has 3 workers, each with a GPU. Essentially, we will be running our function on 3 GPUs distributed across our dask cluster.

This might sound complicated, but Saturn makes it really easy. With the following code, you can connect to your dask cluster:

from dask.distributed import Client, wait
from dask_saturn import SaturnClustern_workers = 3
cluster = SaturnCluster(n_workers=n_workers)
client = Client(cluster)

Once connected, you read in the data as so:

import dask_cudftaxi_dc = dask_cudf.read_csv(
 “s3://nyc-tlc/trip data/yellow_tripdata_2019–01.csv”,
 parse_dates=[“tpep_pickup_datetime”, “tpep_dropoff_datetime”],
 storage_options={“anon”: True},

And finally, you can run the function in the exact same way you did with cuDF. The only difference is that we are now running on a data frame read-in with the dask_cudf library, so we will be leveraging our dask cluster.

How long did it take?

12 ms

That is 13,267 times faster than raw Pandas or you could also say that Pandas is 1,326,650% slower than running on a cluster of 3 GPUs.

Wow. That is fast. You could get a much larger dataset and still run this function fast enough to not notice much of a delay at all.

Note: This is still over 4 times faster than the Pandas vectorized version of our function!

Speed Matters

Hopefully, I’ve convinced you to drop everything right now and go try out Pandas on a GPU using Saturn Cloud.

You could argue that almost 3 minutes to wait for a function to run really isn’t that long, but when you’re focused on programming and in the flow, 3 minutes really feels like forever. It is enough time that you start to get distracted and could easily end up wasting even more time via distraction.

And if you have even larger data, those wait times will only get longer.

So, go try it out for yourself. I think you will be amazed by how much better 12 ms or even 43 ms feels when running on a GPU as compared to over 159,000 seconds.

Also, thank you to Saturn Cloud for working with me on this article! It was my first deep dive into the platform and I was truly impressed.

Bio: Tyler Folkman is the Head of Artificial Intelligence at BEN Group. His work explores applications of machine learning in disrupting and transforming the entertainment and marketing industries. Tyler's work has earned multiple patents in the areas of entity resolution and knowledge extraction from unstructured data.