Pandas on Steroids: End to End Data Science in Python with Dask
End to end parallelized data science from reading big data to data manipulation to visualisation to machine learning.
By Ravi Shankar, Data Scientist
As the saying goes, a data scientist spends 90% of their time in cleaning data and 10% in complaining about the data. Their complaints may range from data size, faulty data distributions, Null values, data randomness, systematic errors in data capture, differences between train and test sets and the list just goes on and on.
One common bottleneck theme is the enormity of data size where either the data doesn't fit into memory or the processing time is so large(In order of multi-mins) that the inherent pattern analysis goes for a toss. Data scientists by nature are curious human beings who want to identify and interpret patterns normally hidden from cursory Drag-N-Drop glance. They need to wear multiple hats and make the data confess via repeated tortures(read iterations 😂)
They wear multiple hats during exploratory data analysis and from a minimal dataset with 6 columns on New York Taxi Fare dataset(https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) - ID, Fare, Time of Trip, Passengers and Location, their questions may range from:
- How the fares have changed Year-Over-Year?
- Has the number of trips increased across the years?
- Do people prefer traveling alone or they have company?
- Has the small distance rides increased as people have become lazier?
- What time of the day and day of week do people want to travel?
- Is there emergence of new hotspots in the city recently except the regular Air Port pickup and drop?
- Are people taking more inter-city trips?
- Has the traffic increased leading to more fares/time taken for the same distances?
- Are there cluster of pick-up and Drop points or areas which see high traffic?
- Are there outliers in data i.e 0 distance and fare of $100+ and so on?
- Do the demand change during Holiday season and airport trips increase?
- Is there any correlation of weather i.e rain or snow with the taxi demand?
Even after answering these questions, multiple sub-threads can emerge i.e can we predict how the Covid affected New year is going to be, How the annual NY marathon shifts taxi demand, If a particular route if more prone to have multiple passengers(Party hub) vs Single Passengers( Airport to Suburbs).
To quench these curiosities, time is of the essence and its criminal to keep the data scientists waiting for 5+ minutes to read a csv file(55 Mn rows) or do a column add followed by aggregation. Also, since the majority of Data Scientists are self-taught and they have been so much used to pandas data frame API that they wouldn't want to start the learning process all over again with a different API like numba, Spark or datatable. I have tried juggling between DPLYR(R), Pandas(Python) and pyspark(Spark) and it is a bit unfulfilling/unproductive considering the need for a uniform pipeline and code syntax. However, for the curious folks, I have written a pyspark starter guide here: https://medium.com/@ravishankar_22148/billions-of-rows-milliseconds-of-time-pyspark-starter-guide-c1f984023bf2
In subsequent sections, I am trying to provide a hands on guide to Dask with minimal architectural change from our beloved Pandas:
1. Data Read and Profiling
Dask vs Pandas speed
How is Dask able to process data ~90X faster i.e Sub 1 secs to 91 secs in pandas.
What makes Dask so popular is the fact that it makes analytics scalable in Python and not necessarily need switching back and forth between SQL, Scala and Python.The magical feature is that this tool requires minimum code changes. It breaks down computation into pandas data frames and thus operates in parallel to enable fast calculations.
2. Data Aggregation
With absolutely 0 change from Pandas API, it is able to perform aggregation and sorting in milliseconds. Please note .compute() function at the end of lazy computation which brings the results of big data to memory in Pandas Data Frame.
3. Machine Learning
Code snippet below provides a working example of feature engineering and ML model building in Dask using XGBoost
Feature Engineering and ML Model with Dask
Dask is a powerful tool offering parallel computing, big data handling and creating end to end Data Science pipeline. It has a steep learning curve as the API is almost similar to pandas and it can handle Out Of Memory computations(~10X of RAM) like a breeze.
Since it is a living blog, I will be writing subsequent parts in Dask series where we will be targeting Kaggle leaderboard using parallel processing. Let me know in comments if you are facing any issues in setting up Dask or unable to perform any Dask Operations or even for a general chit-chat. Happy Learning!!!
Bio: Ravi Shankar is a Data Scientist-II at Amazon Pricing.
Original. Reposted with permission.