Managing Machine Learning Workflows with Scikitlearn Pipelines Part 2: Integrating Grid Search
Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.
In our last post we looked at Scikitlearn pipelines as a method for simplifying machine learning workflows. Designed as a manageable way to apply a series of data transformations followed by the application of an estimator, pipelines were noted as being a simple tool useful mostly for:
 Convenience in creating a coherent and easytounderstand workflow
 Enforcing workflow implementation and the desired order of step applications
 Reproducibility
 Value in persistence of entire pipeline objects (goes to reproducibility and convenience)
Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations. Exhaustive grid search  as opposed to alternate hyperparameter combination optimization schemes such as randomized optimization  tests and compares all possible combinations of desired hyperparameter values, an exercise in exponential growth. The tradeoff in what could end up being exorbitant run times would (hopefully) be the best optimized model possible.
From the official documentation:
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following param_grid:
param_grid = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ]
specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second one with an RBF kernel, and the crossproduct of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001, 0.0001].
Let's first recall the code from the previous post, and run a modified excerpt. Since we will be using a single pipeline for this exercise, we have no need for a full set as in the last post. We will use the iris dataset once again.
Let's bring this very simple pipeline to life.
$ python3 pipelines2a.py
And the model's returned accuracy and hyperparameters:
Test accuracy: 0.867 Model hyperparameters: {'min_impurity_decrease': 0.0, 'min_weight_fraction_leaf': 0.0, 'max_leaf_nodes': None, 'max_depth': None, 'min_impurity_split': None, 'random_state': 42, 'class_weight': None, 'min_samples_leaf': 1, 'splitter': 'best', 'max_features': None, 'presort': False, 'min_samples_split': 2, 'criterion': 'gini'}
Note, once again, that we are applying feature scaling, dimensionality reduction (using PCA to project data onto 2 dimensional space), and finally applying our final estimator.
Now let's add grid search to our pipeline, with the hopes of optimizing our model's hyperparameters and improving its accuracy. Are the default model parameters the best bet? Let's find out.
Since our model uses a decision tree estimator, we will use grid search to optimize the following hyperparameters:
 criterion  This is the function used to evaluate the quality of the split; we will use both options available in Scikitlearn: Gini impurity and information gain (entropy)
 min_samples_leaf  This is the minimum number of samples required for a valid leaf node; we will use the integer range 1 to 5
 max_depth  The is the maximum depth of the tree; we will use the integer range 1 to 5
 min_samples_split  This is the minimum number of samples required in order to split a nonleaf node; we will use the integer range 1 to 5
 presort  This indicates whether or not to presort the data in order to speed up the location of best splits during fitting; this does not have any effect on the resulting model accuracy (only on training times), but has been included for the benefit of using a True/False hyperparameter in our grid search model (fun, right?!?)
Here is the code to use exhaustive grid search in our adapted pipeline example.
Of importance, note that our pipeline is the estimator in the grid search object, and that it is at the level of the grid search object which we fit our model(s). Also note that our grid parameter space is defined in a dictionary and then fed to our grid search object.
What else is is happening during the grid search object's creation? In order to score our resulting models (there are a potential 2 * 5 * 5 * 5 * 2 = 500), we will direct our grid search to evaluate them by their accuracy on the test set. We also have denoted a crossvalidation splitting strategy of 10 folds. Note the following about GridSearchCV:
The parameters of the estimator used to apply these methods are optimized by crossvalidated gridsearch over a parameter grid.
Finally, of course, our model is fit.
You will want to check out the official GridSearchCV module documentation for information on all of the other useful configurations, including, but not limited to, parallelism.
Let's try it out.
$ python3 pipelines2b.py
And here's the result:
Best accuracy: 0.925 Best params: {'clf__min_samples_split': 2, 'clf__criterion': 'gini', 'clf__max_depth': 2, 'clf__min_samples_leaf': 1, 'clf__presort': True}
The script reports back the highest attained accuracy (0.925), which is clearly better than the default 0.867, for not much additional computation, at least not in absolute terms, given our toy dataset. Our exhaustive approach, which included 500 models in this case, could have had much more serious computational impacts on a formidable dataset, as you could imagine.
The script also reports back the optimal hyperparameter configuration for the model with the highest accuracy, which can be seen above. This difference in our simple example should be evidence enough to suggest that Scikitlearn defaults should not be followed blindly.
This all seems overly simple. And it is. Scikitlearn is almost too easy to use, once you know what options are available. Our use of toy datasets is not making it seem any more complex either.
But look at it this way: pipelines and grid search go together like chocolate and peanut butter, and now that we have looked at the basics of how they work together, we can take on some more difficult challenges. More complex data? Sure. Compare a variety of estimators and numerous parameter search spaces for each in order to find the "true" most optimized model possible? Why not? Mix a combination of pipeline transformations into the mix for fun and profit? Yasss!
Join us next time when we will push beyond the basics of configuration and the toy datasets.
Related:
 Managing Machine Learning Workflows with Scikitlearn Pipelines Part 1: A Gentle Introduction
 7 Steps to Mastering Data Preparation with Python
 Machine Learning Workflows in Python from Scratch Part 1: Data Preparation
Top Stories Past 30 Days

