KDnuggets Home » News » 2016 » May » Software » TPOT: A Python Tool for Automating Data Science ( 16:n18 )

TPOT: A Python Tool for Automating Data Science


TPOT is an open-source Python data science automation tool, which operates by optimizing a series of feature preprocessors and models, in order to maximize cross-validation accuracy on data sets.



Feature preprocessing is important

 
As we’ve seen in the previous two examples, machine learning model performance is also affected by how the features are represented. Feature preprocessing is a step in machine learning pipelines where we reshape the features in a manner that makes the data set easier for models to classify.

For example, if we’re working on a harder version of the “hill” vs. “valley” signal processing problem with noise:

TPOT hills and valleys with noise

And we apply a “tuned” random forest to the problem:

import pandas as pd  
import numpy as np  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.decomposition import PCA  
from sklearn.pipeline import make_pipeline  
from sklearn.cross_validation import cross_val_score  
  
hill_valley_noisy_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz', sep='\t', compression='gzip')  
  
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1),  
                            X=hill_valley_noisy_data.drop('class', axis=1).values,  
                            y=hill_valley_noisy_data.loc[:, 'class'].values,  
                            cv=10)  
  
print(cv_scores)  
  
[ 0.52459016  0.51639344  0.57377049  0.6147541   0.6557377   0.56557377  
  0.575       0.575       0.60833333  0.575     ]  
  
print(np.mean(cv_scores))  
  
0.578415300546


We’ll again find that the “tuned” random forest averages a disappointing 57.8% cross-validation accuracy.

However, if we preprocess the features—denoising them via Principal Component Analysis (PCA), for example:

cv_scores = cross_val_score(make_pipeline(PCA(n_components=10),  
                                          RandomForestClassifier(n_estimators=100,  
                                                                 n_jobs=-1)),  
                            X=hill_valley_noisy_data.drop('class', axis=1).values,  
                            y=hill_valley_noisy_data.loc[:, 'class'].values,  
                            cv=10)  
  
print(cv_scores)  
  
[ 0.96721311  0.98360656  0.8852459   0.96721311  0.95081967  0.93442623  
  0.91666667  0.89166667  0.94166667  0.95833333]  
  
print(np.mean(cv_scores))  
  
0.93968579235


We’ll find that the random forest now achieves an average of 94% cross-validation accuracy by applying a simple feature preprocessing step.

Always explore numerous feature representations for your data. Machines learn differently from humans, and a feature representation that makes sense to us may not make sense to the machine.

Automating data science with TPOT

 
To summarize what we’ve learned so far about effective machine learning system design, we should:

  1. Always tune the hyperparameters for our models
  2. Always try out many different models
  3. Always explore numerous feature representations for our data

We must also consider the following:

  1. There are thousands of possible hyperparameter configurations for every model
  2. There are dozens of popular machine learning models
  3. There are dozens of popular feature preprocessing methods

This is why it can be so tedious to design effective machine learning systems. This is also why my collaborators and I created TPOT, an open source Python tool that intelligently automates the entire process.

If your data set is compatible with scikit-learn, then TPOT will automatically optimize a series of feature preprocessors and models that maximize the cross-validation accuracy on the data set. For example, if we want TPOT to solve the noisy “hill” vs. “valley” classification problem:

(Before running the code below, make sure to install TPOT first.)

import pandas as pd  
from sklearn.cross_validation import train_test_split  
from tpot import TPOT  
  
hill_valley_noisy_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz', sep='\t', compression='gzip')  
  
X = hill_valley_noisy_data.drop('class', axis=1).values  
y = hill_valley_noisy_data.loc[:, 'class'].values  
  
X_train, X_test, y_train, y_test = train_test_split(X, y,  
                                                    train_size=0.75,  
                                                    test_size=0.25)  
  
my_tpot = TPOT(generations=10)  
my_tpot.fit(X_train, y_train)  
  
print(my_tpot.score(X_test, y_test))  
  
0.960352039038


Depending on the machine you’re running it on, 10 TPOT generations should take about 5 minutes to complete. During this time, you’re free to browse Hacker News, refill your cup of coffee, or admire the beautiful weather outside. In the meantime, TPOT will handle all of the work for you.

After 5 minutes of optimization, TPOT will discover a pipeline that achieves 96% cross-validation accuracy on the noisy “hill” vs. “valley” problem—better than the hand-designed pipeline we created above!

If we want to see what pipeline TPOT created, TPOT can export the corresponding scikit-learn code for us with the export() command:

my_tpot.export('exported_pipeline.py')


which will look something like:

import pandas as pd  
from sklearn.cross_validation import train_test_split  
from sklearn.linear_model import LogisticRegression  
  
# NOTE: Make sure that the class is labeled 'class' in the data file  
tpot_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz', sep='\t', compression='gzip')  
  
training_indices, testing_indices = train_test_split(tpot_data.index,  
                                                     stratify=tpot_data['class'].values,  
                                                     train_size=0.75,  
                                                     test_size=0.25)  
  
result1 = tpot_data.copy()  
  
# Perform classification with a logistic regression classifier  
lrc1 = LogisticRegression(C=0.0001)  
lrc1.fit(result1.loc[training_indices].drop('class', axis=1).values,  
         result1.loc[training_indices, 'class'].values)  
result1['lrc1-classification'] = lrc1.predict(result1.drop('class', axis=1).values)


and shows us that a tuned logistic regression is probably the best model for this problem.

We’ve designed TPOT to be an end-to-end automated machine learning system, which can act as a drop-in replacement for any scikit-learn model that you’re currently using in your workflow.

If TPOT sounds like the tool you’ve been looking for, here’s a few links that you may find useful:

And as always, please feel free to get in touch.

You can find all of the code used in this article on my GitHub. Enjoy!

Bio: Dr. Randy Olson is a postdoctoral researcher at the University of Pennsylvania. As a member of Prof. Jason H. Moore's research lab, he studies biologically-inspired AI and its applications to biomedical problems.

Original. Reposted with permission.

Related: