How to Run Parallel Time Series Analysis with Dask
In this article, we show you how to run parallel time series analysis with Dask, through a practical Python-based tutorial.

Image by Author | Ideogram
Dask is a Python set of packages focused on leveraging parallel computing, which is useful for data-intensive applications like advanced data analytics and machine learning solutions. One scenario where Dask's parallel computing capabilities are exciting is time series analysis and forecasting.
In this article, we show you how to run parallel time series analysis with Dask, through a practical Python-based tutorial.
Step-by-Step Tutorial
As usual with any Python-related project, we install and import the necessary libraries and packages, including Dask dependencies. The code below has been run in a Google Colab notebook, and the actual installations you may need will depend on the development environment you are working with.
!pip install dask dask[distributed]
import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns
import dask.distributed
from dask.diagnostics import ProgressBar, Profiler, visualize
We will use a publicly available dataset about daily bus and train ridership in Chicago, US. The official dataset website with information about the data attributes can be found here. Meanwhile, we will access a version that is available in this repository.
DATASET_URL = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/CTA_-_Ridership_-_Daily_Boarding_Totals.csv"
def prepare_time_series(url):
ddf = dd.read_csv(url, parse_dates=['service_date'])
ddf['DayOfWeek'] = ddf['service_date'].dt.dayofweek
ddf['Month'] = ddf['service_date'].dt.month
ddf['IsWeekend'] = ddf['DayOfWeek'].isin([5, 6]).astype(int)
return ddf
The function we just defined first loaded the dataset and parsed it by its date attribute: this is done to ensure the data is recognized as time series data at the beginning. The key to leveraging Dask parallel computing lies precisely in this initial step: we are using dask.dataframe
(dd), a similar data structure to pandas DataFrame objects, suited for parallel data processing and computations.
Next, the date attribute, called service_date
has been decomposed into more granular attributes, namely the day of the week, the month, and whether the date is a weekend day or a weekday.
We now define the core function for this tutorial, where the time series analysis process will take place. The function's body is wrapped with the classes provided by Dask for diagnostics integration. The ProgressBar()
class shows real-time computation progress, and the visualize()
is used at the end of the analysis pipeline to create a computational profile visualization based on aggregated quarterly boardings and their variability. Finally, we also show a weekday vs. weekend visual comparison, as well as a seasonal heatmap.
def advanced_time_series_analysis(ddf):
with Profiler() as prof:
with ProgressBar():
quarterly_stats = (
ddf.groupby([ddf['service_date'].dt.year, ddf['service_date'].dt.quarter])['total_rides']
.agg(['sum', 'mean', 'count', 'std'])
.compute()
)
pdf = ddf.compute()
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
quarterly_stats['sum'].plot(kind='line', title='Quarterly Total Boardings')
plt.xlabel('Year-Quarter')
plt.ylabel('Total Boardings')
plt.xticks(rotation=45)
plt.subplot(2, 2, 2)
quarterly_stats['std'].plot(kind='bar', title='Quarterly Boarding Variability')
plt.xlabel('Year-Quarter')
plt.ylabel('Standard Deviation')
plt.xticks(rotation=45)
plt.subplot(2, 2, 3)
weekday_stats = pdf[pdf['IsWeekend'] == 0]['total_rides']
weekend_stats = pdf[pdf['IsWeekend'] == 1]['total_rides']
plt.boxplot([weekday_stats, weekend_stats], labels=['Weekday', 'Weekend'])
plt.title('Boarding Distribution: Weekday vs Weekend')
plt.ylabel('Total Boardings')
plt.subplot(2, 2, 4)
seasonal_pivot = pdf.pivot_table(
index='Month',
columns=pdf['service_date'].dt.year,
values='total_rides',
aggfunc='mean'
)
plt.imshow(seasonal_pivot, cmap='YlGnBu', aspect='auto')
plt.colorbar(label='Avg. Boardings')
plt.title('Seasonal Boarding Heatmap')
plt.xlabel('Year')
plt.ylabel('Month')
plt.tight_layout()
plt.show()
visualize(prof)
return quarterly_stats
Let's revisit what happens in between, at each stage of the data analysis workflow above:
- As the time series dataset spans from 2001 to 2021, we first aggregate the daily data into quarterly summaries. The
compute()
is normally used in Dask to commit every parallel computation and processing step on our data. - We then create a quadruple visualization dashboard. The first one shows quarterly total boardings (notice the abrupt peak down at the start of 2020's pandemic!). The second plot displays variability in quarterly boardings, and the third one shows weekday vs weekend boarding distributions via boxplots. In the last plot, a heatmap indicates boarding levels by month and year, with clear peak periods (darker blue tones) registered in central summer months like August.
- The function returns the aggregated quarterly statistics initially computed to build the visualizations.
Output Visualizations
It only remains to define the main function that will use the two previously custom functions.
def main():
client = dask.distributed.Client()
print("Dask Dashboard URL:", client.dashboard_link)
try:
ddf = prepare_time_series(DATASET_URL)
quarterly_stats = advanced_time_series_analysis(ddf)
print("\nQuarterly Ridership Statistics:")
print(quarterly_stats)
finally:
client.close()
if __name__ == "__main__":
main()
Importantly, as shown in the above code, using Dask requires initializing a distributed client before any other actions. Since we are using a diagnostic tool along the process, we also print the dashboard for URL monitoring. Next, the two defined functions are invoked and finally, the client connection is closed.
Output (printing quarterly statistics):
Quarterly Ridership Statistics:
sum mean count std
service_date service_date
2001 1 119464146 1.327379e+06 90 414382.163454
2 122835569 1.349841e+06 91 390145.157073
3 120878456 1.313896e+06 92 377016.351655
4 120227586 1.306822e+06 92 429737.152536
2002 1 115775156 1.286391e+06 90 404905.943323
... ... ... ... ...
2021 4 57095640 6.206048e+05 92 173110.966947
2022 1 51122612 5.680290e+05 90 166575.867387
2 62381411 6.855100e+05 91 151206.910674
3 66662974 7.245975e+05 92 167509.373303
4 23576296 7.605257e+05 31 180793.919850
Wrapping Up
This article demonstrated the use of the Dask parallel computing framework to efficiently run time series analysis workflows. Dask simplifies the nuances of parallel computation by emulating data structures and other features commonly used in popular Python libraries for data analysis, machine learning, and more.
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.