Machine Learning Pipeline Optimization with TPOT

Let's revisit the automated machine learning project TPOT, and get back up to speed on using open source AutoML tools on our way to building a fully-automated prediction pipeline.



Figure
Photo by Erik Mclean on Unsplash

 

AutoML & TPOT

 
It's been a while since I've had a look at TPOT, the Tree-based Pipeline Optimization Tool. TPOT is a Python automated machine learning (AutoML) tool for optimizing machine learning pipelines through the use of genetic programming. We are told by the authors to consider it our "data science assistant."

The rationale for AutoML stems from this idea: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.

I want to have a fresh look at TPOT to to see if we can flesh out an actual fully-automated assistant for data scientists. What if we could expand on the functionality of TPOT and build an end-to-end prediction pipeline, which we could point at a dataset and get predictions out the other end, with no intervention in between? Sure, other possible tools for this exist, but what better way to understand the machine learning pipeline process, and any particular resulting single constructed pipeline, than building it ourselves, and making the decisions as to what happens along the way.

The goal wouldn't necessarily be to cut the data scientist out of the loop altogether, but to provide a baseline or a number of possible solutions to compare hand-crafted machine learning pipelines to. While the assistant toils in the background, the master can come up with more clever attempted approaches. At the very least, resulting prediction pipelines could be good starting points for a data scientist to manually tweak and intervene with after the fact, with much of the rote work taken care of on her behalf.

An AutoML "solution" could include the tasks of data preprocessing, feature engineering, algorithm selection, algorithm architecture search, and hyperparameter tuning, or some subset or variation of these distinct tasks. Thus, automated machine learning can now be thought of as anything from solely performing a single task, such as automated feature engineering, all the way through to a fully-automated pipeline, from data preprocessing, to feature engineering, to algorithm selection, and so on. So why not build something that does it all?

Anyhow, the first step of this plan is to refamiliarize ourselves with TPOT, the project that will eventually be at the center of our fully-automated prediction pipeline optimizer. TPOT is a Python tool which "automatically creates and optimizes machine learning pipelines using genetic programming." TPOT works in tandem with Scikit-learn, describing itself as a Scikit-learn wrapper. TPOT is open source, written in Python, and aimed at simplifying a machine learning process by way of an AutoML approach based on genetic programming. The end result is automated hyperparameter selection, modeling with a variety of algorithms, and exploration of numerous feature representations, all leading to iterative model building and model evaluation.

 

Figure
Aspects of a machine learning pipeline automated by TPOT (source)

 

Pipeline Optimization

 
We will take a look at something a little more involved than than the simple yet perfectly useful example script that can be found in the TPOT repository. The code should be straightforward and fairly easy to follow, so I won't go over it with a fine-toothed comb.

 

 
Here is an example output from running our optimization script:

 

Optimizing prediction pipeline for the iris dataset with stratified k-fold cross-validation using the accuracy scoring metric

Pipeline optimization iteration: 0
>>> elapsed time: 135.48434898200503 seconds
>>> pipeline score on test data: 1.0

Pipeline optimization iteration: 1
>>> elapsed time: 132.3554882509925 seconds
>>> pipeline score on test data: 1.0

Pipeline optimization iteration: 2
>>> elapsed time: 133.29390010499628 seconds
>>> pipeline score on test data: 1.0

All best pipelines were the same:

Pipeline(memory=None,
         steps=[('stackingestimator',
                 StackingEstimator(estimator=KNeighborsClassifier(algorithm='auto',
                                                                  leaf_size=30,
                                                                  metric='minkowski',
                                                                  metric_params=None,
                                                                  n_jobs=None,
                                                                  n_neighbors=11,
                                                                  p=1,
                                                                  weights='uniform'))),
                ('extratreesclassifier',
                 ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0,
                                      class_weight=None, criterion='gini',
                                      max_depth=None,
                                      max_features=0.9000000000000001,
                                      max_leaf_nodes=None, max_samples=None,
                                      min_impurity_decrease=0.0,
                                      min_impurity_split=None,
                                      min_samples_leaf=18, min_samples_split=14,
                                      min_weight_fraction_leaf=0.0,
                                      n_estimators=100, n_jobs=None,
                                      oob_score=False, random_state=42,
                                      verbose=0, warm_start=False))],
         verbose=False)


The output provides some basic info on the pipeline iterations. If you can't tell from the combination of the script and its output, we have run the optimization process a total of 3 separate times; with each of these, we have used stratified 10-fold cross-validation; and the genetic optimization process has run for 5 generations on a population size of 50 for each of these iterations. Can you figure out how many pipelines were tested during the process? This is something we will have to give consideration to moving forward, not least for the practical reasons associated with computation time.

As you may recall, TPOT outputs the best pipeline — or pipelines, upon multiple iterations — to file, which can then be used to recreate the same experiment, or to use the same pipeline on new data. We will harness this as we move forward creating our fully-automated end-to-end prediction pipeline.

In our case, our script noted that each of the resulting pipelines were identical, and so only outputted one of them. This is a reasonable result on such a small dataset, but due to the nature of genetic optimization, best pipelines could be different between iterations on larger, more complex data.

 

Summary

 
Some things we tried with our script this time that we did not in the past:

  • Cross-validation for model evaluation
  • Iterating on the modeling more than once — likely not useful on such a small dataset, but possibly will be as we progress
  • Comparing resulting pipelines on these multiple iterations — are they all the same?
  • Did you know TPOT now uses PyTorch under the hood to build neural networks for prediction?

Maybe you already see some ways we can improve on the above. Some specific things we might not want to do in our future implementations:

  • We would want to think about our dataset splitting proportions in order to have the ideal amount of training, validation, and testing data
  • As we are using cross-validation for training and validation (related to the above point), we would want to hang on to our testing data to use only on our best performing model, as opposed to on each one
  • Since feature selection/engineering/construction is dealt with using TPOT, we will want to automate the conversion of categorical variables to numerical form prior to feeding them in
  • We will want to be able to deal with a wider array of datasets :)
  • Much, much more!

These points, while important for actual modeling, aren't really an issue right now, since our focus was only on putting the structure in place to iteratively build and evaluate machine learning pipelines. We can address these legitimate concerns as we move forward.

I encourage you to have a look at the TPOT documentation to see what it has in store for us as we leverage it to help build an end-to-end prediction pipeline.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.