Managing Machine Learning Workflows with Scikit-learn Pipelines Part 3: Multiple Models, Pipelines, and Grid Searches

In this post, we will be using grid search to optimize models built from a number of different types estimators, which we will then compare and properly evaluate the best hyperparameters that each model has to offer.

First, I know that I promised we would be past the toy datasets last post, but for comparison purposes we will be sticking with iris for a bit longer. I think it's best we are able to still compare apples to apples throughout our entire process.

Header image

Thus far, in the previous 2 posts, we have:

  • Introduced Scikit-learn piplines
  • Demonstrated their basic usage by creating and comparing some pipelines
  • Introduced grid search
  • Demonstrated how pipelines and grid search work together by using grid search to find optimized hyperparameters, which was then apply to an embedded pipeline

Here's what we plan to do moving forward:

  • In this post, we will be using grid search to optimize models built from a number of different types estimators, which we will then compare
  • In the follow-up post, we will pivot toward using automated machine learning techniques to assist in the optimization of model hyperparameters, the end result of which will be an automatically generated, optimized Scikit-learn pipeline script file, courtesy of TPOT

There won't be much to re-explain this time; I recommend that you read the first post in this series to get a gentle introduction to pipelines in Scikit-learn, and the second post in this series for an overview of integrating grid search into your pipelines. What we will now do is build a series of pipelines of different estimators, using grid search for hyperparameter optimization, after which we will compare these various apples and oranges to determine the most accurate ("best") model.

The code below is well-commented, and if you have read the first 2 installments should be easy to follow.

Note that there is a lot of opportunity for refactoring here. For example, each pipeline is defined explicitly, whereas a simple function could be used as a generator instead; the same goes for grid search objects. The longer form, again, hopefully allows for some better apples to apples comparisons in our next post.

Let's try it out.

  $ python3

And here's the output:

Performing model optimizations...

Estimator: Logistic Regression
Best params: {'clf__penalty': 'l1', 'clf__C': 1.0, 'clf__solver': 'liblinear'}
Best training accuracy: 0.917
Test set accuracy score for best params: 0.967 

Estimator: Logistic Regression w/PCA
Best params: {'clf__penalty': 'l1', 'clf__C': 0.5, 'clf__solver': 'liblinear'}
Best training accuracy: 0.858
Test set accuracy score for best params: 0.933 

Estimator: Random Forest
Best params: {'clf__criterion': 'gini', 'clf__min_samples_split': 2, 'clf__max_depth': 3, 'clf__min_samples_leaf': 2}
Best training accuracy: 0.942
Test set accuracy score for best params: 1.000 

Estimator: Random Forest w/PCA
Best params: {'clf__criterion': 'entropy', 'clf__min_samples_split': 3, 'clf__max_depth': 5, 'clf__min_samples_leaf': 1}
Best training accuracy: 0.917
Test set accuracy score for best params: 0.900 

Estimator: Support Vector Machine
Best params: {'clf__kernel': 'linear', 'clf__C': 3}
Best training accuracy: 0.967
Test set accuracy score for best params: 0.967 

Estimator: Support Vector Machine w/PCA
Best params: {'clf__kernel': 'rbf', 'clf__C': 4}
Best training accuracy: 0.925
Test set accuracy score for best params: 0.900 

Classifier with best test set accuracy: Random Forest

Saved Random Forest grid search pipeline to file: best_gs_pipeline.pkl

Note, importantly, that after we fit our estimators, we then tested each resulting model with best parameters of each of the 6 grid searches on our test dataset. This is not something we did last post, though we were comparing different models to one another, but given the introduction to other concepts the otherwise crucial step of comparing different models on previously unseen test data was overlooked until now. And our example proves why this step is necessary.

Shown above, the model which performed the "best" on our training data (highest training accuracy) was the support vector machine (without PCA), with the linear kernel and C value of 3 (controlling the amount of regularization), which learned how to accurately classify 96.7% of training instances. However, the model which performed best on the test data (the 20% of our dataset previously unseen to all models until after they were trained) was the random forest (without PCA), using the Gini criterion, minimum samples split of 2, max depth of 3, and minimum samples per leaf of 2, which managed to accurately classify 100% of the unseen data instances. Note that this model had a lower training accuracy of 94.2%.

So, beyond seeing how we can mix and match a variety of different estimator types, grid search parameter combinations, and data transformations, as well as how to accurately compare the trained models, it should be apparent that evaluating different models should always include testing them on previously unseen holdout data.

Sick of pipelines yet? Next time we'll look at an alternative approach to automating their construction.