Gold Blog2018’s Top 7 Python Libraries for Data Science and AI

This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.



Editor's note: This post covers Favio's selections for the top 7 Python libraries of 2018. Tomorrow's post will cover his top 7 R packages of the year. You can also explore additional Python libraries for data science.

Header image

 

Introduction

If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.

The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).

If you want to read the past digests, take a look here:

Weekly Digest for Data Science and AI - Revue
Weekly Digest for Data Science and AI - Personal newsletter of Favio Vázquez...www.getrevue.co

Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.

 

Top 7 for Python

 

7. AdaNet — Fast and flexible AutoML with learning guarantees.

 


https://github.com/tensorflow/adanet
AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [Cortes et al. ICML 2017]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.

This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.

You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.

You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.

Installation

adanet depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:

$ pip install "tensorflow>=1.7.0"

 

Installing from source

To install from source, you’ll first need to install bazel following their installation instructions.

Next clone adanet and cd into its root directory:

$ git clone https://github.com/tensorflow/adanet && cd adanet

 

From the adanet root directory run the tests:

$ cd adanet
$ bazel test -c opt //...

 

Once you have verified that everything works well, install adanet as a pip package .

You’re now ready to experiment with adanet.

import adanet

 

Usage

Here you can find two examples on the usage of the package:

tensorflow/adanet
Fast and flexible AutoML with learning guarantees. — tensorflow/adanetgithub.com

You can read more about it in the original blog post:

Introducing AdaNet: Fast and Flexible AutoML with Learning Guarantees
Posted by Charles Weill, Software Engineer, Google AI, NYC Ensemble learning , the art of combining different machine…ai.googleblog.com

 

6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.

 


https://github.com/EpistasisLab/tpot
Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.

The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming.

TPOT can automate a lot of stuff like feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.

What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

This is how it works:

For more details you can read theses great article by Matthew Mayo:

Using AutoML to Generate Machine Learning Pipelines with TPOT
Thus far in this series of posts we have: This post will take a different approach to constructing pipelines. Certainly…www.kdnuggets.com

and Randy Olson:

TPOT: A Python Tool for Automating Data Science
By Randy Olson, University of Pennsylvania. Machine learning is often touted as: A field of study that gives computers…www.kdnuggets.com

Installation

You actually need to follow some instructions before installing TPOT. Here they are:

Installation — TPOT
Optionally, you can install XGBoost if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is…epistasislab.github.io

After that you can just run:

pip install tpot

 

Examples:

First let’s start with the basic Iris dataset:

So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.targetAnd then we save that pipeline. After that, what we have to do is very simple — load the .py file you generated and you’ll see:

import numpy as np

from sklearn.kernel_approximation import RBFSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    RBFSampler(gamma=0.8500000000000001),
    DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

 

And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.

Let’s go the MNIST dataset now:

As you can see, we did the same! Let’s load the .py file you generated again and you’ll see:

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance")

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

 

Super easy and fun. Check them out! Try it and please give them a star!

 

5. SHAP — A unified approach to explain the output of any machine learning model

 


https://github.com/slundberg/shap
Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.

Installation

SHAP can be installed from PyPI

pip install shap

 

or conda-forge

conda install -c conda-forge shap

 

Usage

There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.

Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT, as described in the SHAP NIPS paper that you can read here:

[1802.03888] Consistent Individualized Feature Attribution for Tree Ensembles
Abstract: Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is…arxiv.org

Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:

You can find more examples here:

slundberg/shap
A unified approach to explain the output of any machine learning model. — slundberg/shapgithub.com

Take a look. You’ll be surprised :)