Managing Machine Learning Workflows with Scikitlearn Pipelines Part 3: Multiple Models, Pipelines, and Grid Searches
In this post, we will be using grid search to optimize models built from a number of different types estimators, which we will then compare and properly evaluate the best hyperparameters that each model has to offer.
First, I know that I promised we would be past the toy datasets last post, but for comparison purposes we will be sticking with iris for a bit longer. I think it's best we are able to still compare apples to apples throughout our entire process.
Thus far, in the previous 2 posts, we have:
 Introduced Scikitlearn piplines
 Demonstrated their basic usage by creating and comparing some pipelines
 Introduced grid search
 Demonstrated how pipelines and grid search work together by using grid search to find optimized hyperparameters, which was then apply to an embedded pipeline
Here's what we plan to do moving forward:
 In this post, we will be using grid search to optimize models built from a number of different types estimators, which we will then compare
 In the followup post, we will pivot toward using automated machine learning techniques to assist in the optimization of model hyperparameters, the end result of which will be an automatically generated, optimized Scikitlearn pipeline script file, courtesy of TPOT
There won't be much to reexplain this time; I recommend that you read the first post in this series to get a gentle introduction to pipelines in Scikitlearn, and the second post in this series for an overview of integrating grid search into your pipelines. What we will now do is build a series of pipelines of different estimators, using grid search for hyperparameter optimization, after which we will compare these various apples and oranges to determine the most accurate ("best") model.
The code below is wellcommented, and if you have read the first 2 installments should be easy to follow.
Note that there is a lot of opportunity for refactoring here. For example, each pipeline is defined explicitly, whereas a simple function could be used as a generator instead; the same goes for grid search objects. The longer form, again, hopefully allows for some better apples to apples comparisons in our next post.
Let's try it out.
$ python3 pipelines3.py
And here's the output:
Performing model optimizations... Estimator: Logistic Regression Best params: {'clf__penalty': 'l1', 'clf__C': 1.0, 'clf__solver': 'liblinear'} Best training accuracy: 0.917 Test set accuracy score for best params: 0.967 Estimator: Logistic Regression w/PCA Best params: {'clf__penalty': 'l1', 'clf__C': 0.5, 'clf__solver': 'liblinear'} Best training accuracy: 0.858 Test set accuracy score for best params: 0.933 Estimator: Random Forest Best params: {'clf__criterion': 'gini', 'clf__min_samples_split': 2, 'clf__max_depth': 3, 'clf__min_samples_leaf': 2} Best training accuracy: 0.942 Test set accuracy score for best params: 1.000 Estimator: Random Forest w/PCA Best params: {'clf__criterion': 'entropy', 'clf__min_samples_split': 3, 'clf__max_depth': 5, 'clf__min_samples_leaf': 1} Best training accuracy: 0.917 Test set accuracy score for best params: 0.900 Estimator: Support Vector Machine Best params: {'clf__kernel': 'linear', 'clf__C': 3} Best training accuracy: 0.967 Test set accuracy score for best params: 0.967 Estimator: Support Vector Machine w/PCA Best params: {'clf__kernel': 'rbf', 'clf__C': 4} Best training accuracy: 0.925 Test set accuracy score for best params: 0.900 Classifier with best test set accuracy: Random Forest Saved Random Forest grid search pipeline to file: best_gs_pipeline.pkl
Note, importantly, that after we fit our estimators, we then tested each resulting model with best parameters of each of the 6 grid searches on our test dataset. This is not something we did last post, though we were comparing different models to one another, but given the introduction to other concepts the otherwise crucial step of comparing different models on previously unseen test data was overlooked until now. And our example proves why this step is necessary.
Shown above, the model which performed the "best" on our training data (highest training accuracy) was the support vector machine (without PCA), with the linear kernel and C value of 3 (controlling the amount of regularization), which learned how to accurately classify 96.7% of training instances. However, the model which performed best on the test data (the 20% of our dataset previously unseen to all models until after they were trained) was the random forest (without PCA), using the Gini criterion, minimum samples split of 2, max depth of 3, and minimum samples per leaf of 2, which managed to accurately classify 100% of the unseen data instances. Note that this model had a lower training accuracy of 94.2%.
So, beyond seeing how we can mix and match a variety of different estimator types, grid search parameter combinations, and data transformations, as well as how to accurately compare the trained models, it should be apparent that evaluating different models should always include testing them on previously unseen holdout data.
Sick of pipelines yet? Next time we'll look at an alternative approach to automating their construction.
Related:
 Managing Machine Learning Workflows with Scikitlearn Pipelines Part 1: A Gentle Introduction
 Managing Machine Learning Workflows with Scikitlearn Pipelines Part 2: Integrating Grid Search
 Using Genetic Algorithm for Optimizing Recurrent Neural Networks
Top Stories Past 30 Days

