How to Scale Sklearn with Dask
Here's how Dask applies the building blocks of sklearn to bring ML modeling workflows to the next level of scalability via high-performance parallel computing

Image by Author | Ideogram
Dask is a Python set of libraries that takes parallel computing to scale, enabling efficient task execution across multiple cores or clusters. In combination with elements from machine learning (ML) libraries like Sklearn (scikit-learn), Dask provides scalable data preprocessing, model training, and hyperparameter tuning for large datasets.
This article adopts a tutorial-styled narrative to navigate you through the joint use of Dask to scale the original capabilities of Sklearn for developing ML modeling workflows.
Step-by-Step Tutorial
As usual with any Python-related project, everything starts by installing and importing the necessary libraries. The code below has been run in a Google Colab notebook, hence the required prior installations may vary depending on the development environment you are using.
!pip install dask distributed dask_ml
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.distributed
from dask_ml.preprocessing import StandardScaler
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
import matplotlib.pyplot as plt
We start defining a function to load and preprocess the dataset. Although Dask is intended for much larger datasets, in this tutorial we will use a middle-sized dataset for illustrative purposes: the Chicago ridership open dataset, namely a stored version ready to load directly from a GitHub URL.
DATASET_URL = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/CTA_-_Ridership_-_Daily_Boarding_Totals.csv"
def load_and_preprocess_dataset(url):
# Load the dataset using Dask to handle large files efficiently
ddf = dd.read_csv(url, parse_dates=['service_date'])
# Basic data cleaning and feature engineering
ddf['DayOfWeek'] = ddf['service_date'].dt.dayofweek
ddf['Month'] = ddf['service_date'].dt.month
ddf['IsWeekend'] = ddf['DayOfWeek'].isin([5, 6]).astype(int)
# Create a binary classification target:
# Predict if ridership is above the median (high ridership day)
median_ridership = ddf['total_rides'].median().compute()
ddf['HighRidership'] = (ddf['total_rides'] > median_ridership).astype(int)
return ddf
Important remarks about what we just did in the above code:
- Dask provides a dataframe package similar to Pandas dataframes (we nicknamed it 'dd' when importing it), suitable to manage large data volumes more efficiently.
- The dataset was originally intended for time series forecasting, namely predicting daily bus and train boardings, but we are reformulating it for binary classification by adding a new target variable to classify ridership as either low or high depending on the daily total of boardings being above or below the median.
Let's continue adding some more code:
client = dask.distributed.Client()
print("Dask Dashboard URL:", client.dashboard_link)
ddf = load_and_preprocess_dataset(DATASET_URL)
feature_columns = ['DayOfWeek', 'Month', 'IsWeekend']
target_column = 'HighRidership'
X = ddf[feature_columns].to_dask_array(lengths=True) # Specify lengths=True
y = ddf[target_column].to_dask_array(lengths=True)
In the above code, we just:
- Initialized a Dask distributed client.
- Loaded and preprocessed the data by using the previously defined function.
- Selected three predictor features and the newly created binary class for our ML task.
- Converted the selected features and target to Dask arrays for the sake of compatibility: most ML models and estimators in Dask are best suited to operate with Dask arrays. Setting
lengths=Trueensures that the sizes of data chunks internally used by Dask in parallel computations are aligned in upcoming data transformations.
Next, we scale the data attributes and split the data into training and test sets. As you will see, we are about to start using analogous functionalities to those in sklearn through the Dask library: concretely, StandardScaler, and train_test_split. Looks like sklearn, but it's Dask! Of course, the train-test splitting process occurs in a distributed fashion.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
We are ready to train our logistic regression classifier! As the code below shows, the process, classes, and method used to train the model and evaluate it on train and test sets look almost identical to those from sklearn, except for one little nuance: since metrics computations are handled lazily in Dask, it is necessary to append the .compute() call in the instructions for calculating the model's accuracy.
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train).compute()
test_score = model.score(X_test, y_test).compute()
print(f"Training Accuracy: {train_score}")
print(f"Testing Accuracy: {test_score}")
Output:
Training Accuracy: 0.7851586807716241
Testing Accuracy: 0.7879353233830846
A mandatorily good practice when finalizing the use of Dask in your project is closing the session with the client:
client.close()
Wrap Up
This article illustrated how to use Dask library package and functionalities for scaling machine learning model development. By adopting many of the characteristics and procedures used in sklearn, Dask makes it easy for developers familiar with the well-known machine learning library to transition into more scalable ML workflows that leverage parallel and distributed computing capabilities.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.