KDnuggets Home » News » 2016 » May » Software » TPOT: A Python Tool for Automating Data Science ( 16:n18 )

TPOT: A Python Tool for Automating Data Science


TPOT is an open-source Python data science automation tool, which operates by optimizing a series of feature preprocessors and models, in order to maximize cross-validation accuracy on data sets.



By Randy Olson, University of Pennsylvania.

Machine learning is often touted as:

A field of study that gives computers the ability to learn without being explicitly programmed.

Despite this common claim, anyone who has worked in the field knows that designing effective machine learning systems is a tedious endeavor, and typically requires considerable experience with machine learning algorithms, expert knowledge of the problem domain, and brute force search to accomplish. Thus, contrary to what machine learning enthusiasts would have us believe, machine learning still requires a considerable amount of explicit programming.

In this article, we’re going to go over three aspects of machine learning pipeline design that tend to be tedious but nonetheless important. After that, we’re going to step through a demo for a tool that intelligently automates the process of machine learning pipeline design, so we can spend our time working on the more interesting aspects of data science.

Let’s get started.

Model hyperparameter tuning is important

 
One of the most tedious parts of machine learning is model hyperparameter tuning.

Support vector machines require us to select the ideal kernel, the kernel’s parameters, and the penalty parameter C. Artificial neural networks require us to tune the number of hidden layers, number of hidden nodes, and many more hyperparameters. Even random forests require us to tune the number of trees in the ensemble at a minimum.

All of these hyperparameters can have significant impacts on how well the model performs. For example, on the MNIST handwritten digit data set:

MNIST digits

If we fit a random forest classifier with only 10 trees (scikit-learn’s default):

import pandas as pd  
import numpy as np  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.cross_validation import cross_val_score  

mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='\t', compression='gzip')  

cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1),  
                            X=mnist_data.drop('class', axis=1).values,  
                            y=mnist_data.loc[:, 'class'].values,  
                            cv=10)  

print(cv_scores)  

[ 0.93461813  0.96287836  0.94688749  0.94072275  0.95114286  0.94570653  
  0.94884253  0.94311848  0.93825043  0.95668954]  

print(np.mean(cv_scores))  

0.946885709001


The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?

cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1),  
                            X=mnist_data.drop('class', axis=1).values,  
                            y=mnist_data.loc[:, 'class'].values,  
                            cv=10)  

print(cv_scores)  

[ 0.96259814  0.97829812  0.9684466   0.96700471  0.966       0.96399486  
  0.97113461  0.96755752  0.96397942  0.97684391]  

print(np.mean(cv_scores))  

0.968585789367


With such a minor change, we improved the random forest’s average cross-validation accuracy from 94.7% to 96.9%. This small improvement in accuracy can translate into millions of additional digits classified correctly if we’re applying this model on the scale of, say, processing addresses for the U.S. Postal Service.

Never use the defaults for your model. Hyperparameter tuning is vitally important for every machine learning project.

Model selection is important

 
We all love to think that our favorite model will perform well on every machine learning problem, but different models are better suited for different tasks.

For example, if we’re working on a signal processing problem where we need to classify whether there’s a “hill” or “valley” in the time series:

TPOT hills and valleys

And we apply a “tuned” random forest to the problem:

import pandas as pd  
import numpy as np  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.linear_model import LogisticRegression  
from sklearn.cross_validation import cross_val_score  
  
hill_valley_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_without_noise.csv.gz', sep='\t', compression='gzip')  
  
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1),  
                            X=hill_valley_data.drop('class', axis=1).values,  
                            y=hill_valley_data.loc[:, 'class'].values,  
                            cv=10)  
  
print(cv_scores)  
  
[ 0.64754098  0.64754098  0.57024793  0.61983471  0.62809917  0.61983471  
  0.70247934  0.59504132  0.49586777  0.65289256]  
  
print(np.mean(cv_scores))  
  
0.617937948787


Then we’re going to find that the random forest isn’t well-suited for signal processing tasks like this one when it achieves a disappointing average of 61.8% cross-validation accuracy.

What if we tried a different model, for example a logistic regression?

cv_scores = cross_val_score(LogisticRegression(),  
                            X=hill_valley_data.drop('class', axis=1).values,  
                            y=hill_valley_data.loc[:, 'class'].values,  
                            cv=10)  
  
print(cv_scores)  
  
[ 1.          1.          1.          0.99173554  1.          0.98347107  
  1.          0.99173554  1.          1.        ]  
  
print(np.mean(cv_scores))  
  
0.996694214876


We’ll find that a logistic regression is well-suited for this signal processing task—in fact, it easily achieves near-100% cross-validation accuracy without any hyperparameter tuning at all.

Always try out many different machine learning models for every machine learning task that you work on. Trying out—and tuning—different machine learning models is another tedious yet vitally important step of machine learning pipeline design.