How to Run Parallel Time Series Analysis with Dask

In this article, we show you how to run parallel time series analysis with Dask, through a practical Python-based tutorial.

By Iván Palomares Carrascosa, KDnuggets Technical Content Specialist on January 30, 2025 in Python

How to Run Parallel Time Series with Dask

Image by Author | Ideogram

Dask is a Python set of packages focused on leveraging parallel computing, which is useful for data-intensive applications like advanced data analytics and machine learning solutions. One scenario where Dask's parallel computing capabilities are exciting is time series analysis and forecasting.

In this article, we show you how to run parallel time series analysis with Dask, through a practical Python-based tutorial.

Step-by-Step Tutorial

As usual with any Python-related project, we install and import the necessary libraries and packages, including Dask dependencies. The code below has been run in a Google Colab notebook, and the actual installations you may need will depend on the development environment you are working with.

!pip install dask dask[distributed]

import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns

import dask.distributed
from dask.diagnostics import ProgressBar, Profiler, visualize

We will use a publicly available dataset about daily bus and train ridership in Chicago, US. The official dataset website with information about the data attributes can be found here. Meanwhile, we will access a version that is available in this repository.

DATASET_URL = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/CTA_-_Ridership_-_Daily_Boarding_Totals.csv"

def prepare_time_series(url):
    ddf = dd.read_csv(url, parse_dates=['service_date'])
    
    ddf['DayOfWeek'] = ddf['service_date'].dt.dayofweek
    ddf['Month'] = ddf['service_date'].dt.month
    ddf['IsWeekend'] = ddf['DayOfWeek'].isin([5, 6]).astype(int)
    
    return ddf

The function we just defined first loaded the dataset and parsed it by its date attribute: this is done to ensure the data is recognized as time series data at the beginning. The key to leveraging Dask parallel computing lies precisely in this initial step: we are using dask.dataframe (dd), a similar data structure to pandas DataFrame objects, suited for parallel data processing and computations.

Next, the date attribute, called service_date has been decomposed into more granular attributes, namely the day of the week, the month, and whether the date is a weekend day or a weekday.

We now define the core function for this tutorial, where the time series analysis process will take place. The function's body is wrapped with the classes provided by Dask for diagnostics integration. The Profiler() class tracks computational performance, the ProgressBar() class shows real-time computation progress, and the visualize() is used at the end of the analysis pipeline to create a computational profile visualization based on aggregated quarterly boardings and their variability. Finally, we also show a weekday vs. weekend visual comparison, as well as a seasonal heatmap.

def advanced_time_series_analysis(ddf):
    
    with Profiler() as prof:
        with ProgressBar():
            quarterly_stats = (
                ddf.groupby([ddf['service_date'].dt.year, ddf['service_date'].dt.quarter])['total_rides']
                .agg(['sum', 'mean', 'count', 'std'])
                .compute()
            )
            
            pdf = ddf.compute()
            
            plt.figure(figsize=(15, 10))
            
            plt.subplot(2, 2, 1)
            quarterly_stats['sum'].plot(kind='line', title='Quarterly Total Boardings')
            plt.xlabel('Year-Quarter')
            plt.ylabel('Total Boardings')
            plt.xticks(rotation=45)
            
            plt.subplot(2, 2, 2)
            quarterly_stats['std'].plot(kind='bar', title='Quarterly Boarding Variability')
            plt.xlabel('Year-Quarter')
            plt.ylabel('Standard Deviation')
            plt.xticks(rotation=45)
            
            plt.subplot(2, 2, 3)
            weekday_stats = pdf[pdf['IsWeekend'] == 0]['total_rides']
            weekend_stats = pdf[pdf['IsWeekend'] == 1]['total_rides']
            plt.boxplot([weekday_stats, weekend_stats], labels=['Weekday', 'Weekend'])
            plt.title('Boarding Distribution: Weekday vs Weekend')
            plt.ylabel('Total Boardings')
            
            plt.subplot(2, 2, 4)
            seasonal_pivot = pdf.pivot_table(
                index='Month', 
                columns=pdf['service_date'].dt.year, 
                values='total_rides', 
                aggfunc='mean'
            )
            plt.imshow(seasonal_pivot, cmap='YlGnBu', aspect='auto')
            plt.colorbar(label='Avg. Boardings')
            plt.title('Seasonal Boarding Heatmap')
            plt.xlabel('Year')
            plt.ylabel('Month')
            
            plt.tight_layout()
            plt.show()
    
    visualize(prof)
    
    return quarterly_stats

Let's revisit what happens in between, at each stage of the data analysis workflow above:

As the time series dataset spans from 2001 to 2021, we first aggregate the daily data into quarterly summaries. The compute() is normally used in Dask to commit every parallel computation and processing step on our data.
We then create a quadruple visualization dashboard. The first one shows quarterly total boardings (notice the abrupt peak down at the start of 2020's pandemic!). The second plot displays variability in quarterly boardings, and the third one shows weekday vs weekend boarding distributions via boxplots. In the last plot, a heatmap indicates boarding levels by month and year, with clear peak periods (darker blue tones) registered in central summer months like August.
The function returns the aggregated quarterly statistics initially computed to build the visualizations.

Output Visualizations

It only remains to define the main function that will use the two previously custom functions.

def main():
    client = dask.distributed.Client()
    print("Dask Dashboard URL:", client.dashboard_link)
    
    try:
        ddf = prepare_time_series(DATASET_URL)
        quarterly_stats = advanced_time_series_analysis(ddf)
        
        print("\nQuarterly Ridership Statistics:")
        print(quarterly_stats)
    
    finally:
        client.close()

if __name__ == "__main__":
    main()

Importantly, as shown in the above code, using Dask requires initializing a distributed client before any other actions. Since we are using a diagnostic tool along the process, we also print the dashboard for URL monitoring. Next, the two defined functions are invoked and finally, the client connection is closed.

Output (printing quarterly statistics):

Quarterly Ridership Statistics:
                                 sum          mean  count            std
service_date service_date                                               
2001         1             119464146  1.327379e+06     90  414382.163454
             2             122835569  1.349841e+06     91  390145.157073
             3             120878456  1.313896e+06     92  377016.351655
             4             120227586  1.306822e+06     92  429737.152536
2002         1             115775156  1.286391e+06     90  404905.943323
...                              ...           ...    ...            ...
2021         4              57095640  6.206048e+05     92  173110.966947
2022         1              51122612  5.680290e+05     90  166575.867387
             2              62381411  6.855100e+05     91  151206.910674
             3              66662974  7.245975e+05     92  167509.373303
             4              23576296  7.605257e+05     31  180793.919850

Wrapping Up

This article demonstrated the use of the Dask parallel computing framework to efficiently run time series analysis workflows. Dask simplifies the nuances of parallel computation by emulating data structures and other features commonly used in popular Python libraries for data analysis, machine learning, and more.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

How to Run Parallel Time Series Analysis with Dask

Step-by-Step Tutorial

Output Visualizations

Wrapping Up

More On This Topic

Latest Posts

Top Posts