Using AutoML to Generate Machine Learning Pipelines with TPOT

This post will take a different approach to constructing pipelines. Certainly the title gives away this difference: instead of hand-crafting pipelines and hyperparameter optimization, and performing model selection ourselves, we will instead automate these processes.



Thus far in this series of posts we have:

This post will take a different approach to constructing pipelines. Certainly the title gives away this difference: instead of hand-crafting pipelines and hyperparameter optimization, and performing model selection ourselves, we will instead automate these processes. We will use the automated machine learning tool TPOT to do some of our heavy lifting.

Header image

But first, what is automated machine learning (AutoML)?

Data scientist and leading automated machine learning proponent Randy Olson states that effective machine learning design requires us to:

  • Always tune the hyperparameters for our models
  • Always try out many different models
  • Always explore numerous feature representations for our data

Therefore, at a fundamental level, we can consider AutoML to be the tasks of algorithm selection, hyperparameter tuning, iterative modeling, and model assessment.

For insight into why this is so, consider what AI Researcher and Stanford University PhD candidate S. Zayd Enam wrote some time ago in his fantastic blog post titled "Why is machine learning 'hard'?":

The difficulty is that machine learning is a fundamentally hard debugging problem. Debugging for machine learning happens in two cases: 1) your algorithm doesn't work or 2) your algorithm doesn't work well enough.[...] Very rarely does an algorithm work the first time and so this ends up being where the majority of time is spent in building algorithms.

Basically, if an algorithm does not work, or does not do so well enough, and the process of choosing and refinining becomes iterative, this exposes an opportunity for automation, hence automated machine learning.

I have previously attempted to capture AutoML's essence as follows:

If, as Sebastian Raschka has described it, computer programming is about automation, and machine learning is "all about automating automation," then automated machine learning is "the automation of automating automation." Follow me, here: programming relieves us by managing rote tasks; machine learning allows computers to learn how to best perform these rote tasks; automated machine learning allows for computers to learn how to optimize the outcome of learning how to perform these rote actions.

This is a very powerful idea; while we previously have had to worry about tuning parameters and hyperparameters, automated machine learning systems can learn the best way to tune these for optimal outcomes by a number of different possible methods.

The rationale for AutoML stems from this idea: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.

So how do we actually do this? TPOT is a Python tool which "automatically creates and optimizes machine learning pipelines using genetic programming." TPOT works in tandem with Scikit-learn, describing itself as a Scikit-learn wrapper. TPOT is open source, written in Python, and aimed at simplifying a machine learning process by way of an AutoML approach based on genetic programming. The end result is automated hyperparameter selection, modeling with a variety of algorithms, and exploration of numerous feature representations, all leading to iterative model building and model evaluation.

You can read more about using TPOT here, and can read an interview with the project's (former) lead developer here.

TPOT

And now back to the topic at hand. We are interested in using TPOT to generate optimized machine learning pipelines for us. You will likely be surprised at how easy it is to do so.

The code:

Note that almost everything we need is accomplished with this single line:

tpot = TPOTClassifier(generations=10, verbosity=2)


This begins the genetic programming-based hyperparameter selection, modeling with a variety of algorithms, and feature representations exploration. The result should be an optimized model.

When it is finished, TPOT will display the "best" model (based on test data accuracy) hyperparameters, and will also output the pipelines as an execution-ready Python script file for later use, and further investigation.

Let's run the code with this:

  $ python3 tpot_test.py


And here's what happens:

Generation 1 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 2 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 3 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 4 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 5 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 6 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 7 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 8 - Current best internal CV score: 0.9833333333333332                                                                         
Generation 9 - Current best internal CV score: 0.9833333333333334                                                                         
Generation 10 - Current best internal CV score: 0.9833333333333334                                                                        
                                                                                                                                          
Best pipeline: XGBClassifier(RobustScaler(PolynomialFeatures(XGBClassifier(LogisticRegression(input_matrix, C=10.0, dual=False, penalty=l2), learning_rate=0.001, max_depth=6, min_child_weight=9, n_estimators=100, nthread=1, subsample=0.9000000000000001), degree=2, include_bias=False, interaction_only=False)), learning_rate=0.01, max_depth=4, min_child_weight=13, n_estimators=100, nthread=1, subsample=0.9000000000000001)
TPOT classifier finished in 409.84891986846924 seconds
Best pipeline test accuracy: 1.000


And there you have it. The process took just under 7 minutes to complete, and the resulting XGBoost-based pipeline was able to accurately classify 100% of test data instances. This is obviously a toy dataset, and the cross-validation score was not altered much at all during the genetic process, but since we have come this far with using iris for a variety of pipeline construction (including using TPOT), we are definitely ready to move beyond and onto some other data.

The above execution generates this Python pipeline script and saves it as tpot_iris_pipeline.py':

Remember that TPOT uses a genetic approach to optimization, and so subsequent executions result in different results. Don't blame me; blame evolution.

This process may have generated more questions than answers, especially if automated approaches to model optimization are new to you. We will revisit this in a future post to expand on these concepts. For now just remember that AutoML likely seems -- from a single example, mind you -- that it may be a powerful tool for model optimization.

You may want to follow this up by reading The Current State of Automated Machine Learning, an article I wrote one year ago, but which is still relevant.

 
Related: