Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator.


Are you familiar with Scikit-learn Pipelines?

They are an extremely simple yet very useful tool for managing machine learning workflows.

A typical machine learning task generally involves data preparation to varying degrees. We won't get into the wide array of activities which make up data preparation here, but there are many. Such tasks are known for taking up a large proportion of time spent on any given machine learning task.

After a dataset is cleaned up from a potential initial state of massive disarray, however, there are still several less-intensive yet no less-important transformative data preprocessing steps such as feature extraction, feature scaling, and dimensionality reduction, to name just a few.

Maybe your preprocessing requires only one of these tansformations, such as some form of scaling. But maybe you need to string a number of transformations together, and ultimately finish off with an estimator of some sort. This is where Scikit-learn Pipelines can be helpful.

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator. In fact, that's really all it is:

Pipeline of transforms with a final estimator.

That's it. Ultimately, this simple tool is useful for:

  • Convenience in creating a coherent and easy-to-understand workflow
  • Enforcing workflow implementation and the desired order of step applications
  • Reproducibility
  • Value in persistence of entire pipeline objects (goes to reproducibility and convenience)

So let's have a quick look at Pipelines. Specifically, here is what we will do.

Build 3 pipelines, each with a different estimator (classification algorithm), using default hyperparameters:

To demonstrate pipeline transforms, will perform:

  • feature scaling
  • dimensionality reduction, using PCA to project data onto 2 dimensional space

We will then end with fitting to our final estimators.

Afterward, and almost completely unrelated, in order to make this a little more like a full-fledged workflow (it still isn't, but closer), we will:

  • Followup with scoring test data
  • Compare pipeline model accuracies
  • Identify the "best" model, meaning that which has the highest accuracy on our test data
  • Persist (save to file) the entire pipeline of the "best" model

Granted, given that we will use default hyperparameters, this likely won't result in the most accurate possible models, but it will provide a sense of how to use simple pipelines. We will come back to the question of more complex modeling, hyperparameter tuning, and model evaluation afterward.

Oh, and for additional simplicity, we are using the iris dataset. The code is well-commented, and should be easy to follow.

Let's run our script and see what happens.

 $ python3

Logistic Regression pipeline test accuracy: 0.933
Support Vector Machine pipeline test accuracy: 0.900
Decision Tree pipeline test accuracy: 0.867
Classifier with best accuracy: Logistic Regression
Saved Logistic Regression pipeline to file

So there you have it; a simple implementation of Scikit-learn pipelines. In this particular case, our logistic regression-based pipeline with default parameters scored the highest accuracy.

As mentioned above, however, these results likely don't represent our best efforts. What if we did want to test out a series of different hyperparameters? Can we use grid search? Can we incorporate automated methods for tuning these hyperparameters? Can AutoML fit in to this picture somewhere? What about using cross-validation?

Over the next couple of posts we will take a look at these additional issues, and see how these simple pieces fit together to make pipelines much more powerful than they may first appear to be given our initial example.