Managing Machine Learning Workflows with Scikit-learn Pipelines Part 2: Integrating Grid Search

Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.



Pipeline

In our last post we looked at Scikit-learn pipelines as a method for simplifying machine learning workflows. Designed as a manageable way to apply a series of data transformations followed by the application of an estimator, pipelines were noted as being a simple tool useful mostly for:

  • Convenience in creating a coherent and easy-to-understand workflow
  • Enforcing workflow implementation and the desired order of step applications
  • Reproducibility
  • Value in persistence of entire pipeline objects (goes to reproducibility and convenience)

Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations. Exhaustive grid search -- as opposed to alternate hyperparameter combination optimization schemes such as randomized optimization -- tests and compares all possible combinations of desired hyperparameter values, an exercise in exponential growth. The trade-off in what could end up being exorbitant run times would (hopefully) be the best optimized model possible.

From the official documentation:

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following param_grid:

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

 

specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second one with an RBF kernel, and the cross-product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001, 0.0001].

Let's first recall the code from the previous post, and run a modified excerpt. Since we will be using a single pipeline for this exercise, we have no need for a full set as in the last post. We will use the iris dataset once again.

Let's bring this very simple pipeline to life.

  $ python3 pipelines-2a.py


And the model's returned accuracy and hyperparameters:

Test accuracy: 0.867

Model hyperparameters:
 {'min_impurity_decrease': 0.0, 'min_weight_fraction_leaf': 0.0, 'max_leaf_nodes': None, 'max_depth': None,
  'min_impurity_split': None, 'random_state': 42, 'class_weight': None, 'min_samples_leaf': 1, 'splitter': 'best', 
  'max_features': None, 'presort': False, 'min_samples_split': 2, 'criterion': 'gini'}


Note, once again, that we are applying feature scaling, dimensionality reduction (using PCA to project data onto 2 dimensional space), and finally applying our final estimator.

Now let's add grid search to our pipeline, with the hopes of optimizing our model's hyperparameters and improving its accuracy. Are the default model parameters the best bet? Let's find out.

Since our model uses a decision tree estimator, we will use grid search to optimize the following hyperparameters:

  • criterion - This is the function used to evaluate the quality of the split; we will use both options available in Scikit-learn: Gini impurity and information gain (entropy)
  • min_samples_leaf - This is the minimum number of samples required for a valid leaf node; we will use the integer range 1 to 5
  • max_depth - The is the maximum depth of the tree; we will use the integer range 1 to 5
  • min_samples_split - This is the minimum number of samples required in order to split a non-leaf node; we will use the integer range 1 to 5
  • presort - This indicates whether or not to presort the data in order to speed up the location of best splits during fitting; this does not have any effect on the resulting model accuracy (only on training times), but has been included for the benefit of using a True/False hyperparameter in our grid search model (fun, right?!?)

Here is the code to use exhaustive grid search in our adapted pipeline example.

Of importance, note that our pipeline is the estimator in the grid search object, and that it is at the level of the grid search object which we fit our model(s). Also note that our grid parameter space is defined in a dictionary and then fed to our grid search object.

What else is is happening during the grid search object's creation? In order to score our resulting models (there are a potential 2 * 5 * 5 * 5 * 2 = 500), we will direct our grid search to evaluate them by their accuracy on the test set. We also have denoted a cross-validation splitting strategy of 10 folds. Note the following about GridSearchCV:

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Finally, of course, our model is fit.

You will want to check out the official GridSearchCV module documentation for information on all of the other useful configurations, including, but not limited to, parallelism.

Let's try it out.

  $ python3 pipelines-2b.py


And here's the result:

Best accuracy: 0.925

Best params:
 {'clf__min_samples_split': 2, 'clf__criterion': 'gini', 'clf__max_depth': 2, 
  'clf__min_samples_leaf': 1, 'clf__presort': True}


The script reports back the highest attained accuracy (0.925), which is clearly better than the default 0.867, for not much additional computation, at least not in absolute terms, given our toy dataset. Our exhaustive approach, which included 500 models in this case, could have had much more serious computational impacts on a formidable dataset, as you could imagine.

The script also reports back the optimal hyperparameter configuration for the model with the highest accuracy, which can be seen above. This difference in our simple example should be evidence enough to suggest that Scikit-learn defaults should not be followed blindly.

This all seems overly simple. And it is. Scikit-learn is almost too easy to use, once you know what options are available. Our use of toy datasets is not making it seem any more complex either.

But look at it this way: pipelines and grid search go together like chocolate and peanut butter, and now that we have looked at the basics of how they work together, we can take on some more difficult challenges. More complex data? Sure. Compare a variety of estimators and numerous parameter search spaces for each in order to find the "true" most optimized model possible? Why not? Mix a combination of pipeline transformations into the mix for fun and profit? Yasss!

Join us next time when we will push beyond the basics of configuration and the toy datasets.

 
Related: