2018’s Top 7 Python Libraries for Data Science and AI
This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.
Editor's note: This post covers Favio's selections for the top 7 Python libraries of 2018. Tomorrow's post will cover his top 7 R packages of the year. You can also explore additional Python libraries for data science.
Introduction
If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.
The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).
If you want to read the past digests, take a look here:
Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.
Top 7 for Python
7. AdaNet — Fast and flexible AutoML with learning guarantees.
https://github.com/tensorflow/adanet
This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.
You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.
You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.
Installation
adanet
depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:
$ pip install "tensorflow>=1.7.0"
Installing from source
To install from source, you’ll first need to install bazel
following their installation instructions.
Next clone adanet
and cd
into its root directory:
$ git clone https://github.com/tensorflow/adanet && cd adanet
From the adanet
root directory run the tests:
$ cd adanet $ bazel test -c opt //...
Once you have verified that everything works well, install adanet
as a pip package .
You’re now ready to experiment with adanet
.
import adanet
Usage
Here you can find two examples on the usage of the package:
tensorflow/adanet
Fast and flexible AutoML with learning guarantees. — tensorflow/adanetgithub.com
You can read more about it in the original blog post:
6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.
https://github.com/EpistasisLab/tpot
The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming.
TPOT can automate a lot of stuff like feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.
What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
This is how it works:
For more details you can read theses great article by Matthew Mayo:
and Randy Olson:
Installation
You actually need to follow some instructions before installing TPOT. Here they are:
After that you can just run:
pip install tpot
Examples:
First let’s start with the basic Iris dataset:
So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.target
. And then we save that pipeline. After that, what we have to do is very simple — load the .py
file you generated and you’ll see:
import numpy as np from sklearn.kernel_approximation import RBFSampler from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.tree import DecisionTreeClassifier # NOTE: Make sure that the class is labeled 'class' in the data file tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64) features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1) training_features, testing_features, training_classes, testing_classes = \ train_test_split(features, tpot_data['class'], random_state=42) exported_pipeline = make_pipeline( RBFSampler(gamma=0.8500000000000001), DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9) ) exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features)
And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.
Let’s go the MNIST dataset now:
As you can see, we did the same! Let’s load the .py
file you generated again and you’ll see:
import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier # NOTE: Make sure that the class is labeled 'class' in the data file tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64) features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1) training_features, testing_features, training_classes, testing_classes = \ train_test_split(features, tpot_data['class'], random_state=42) exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance") exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features)
Super easy and fun. Check them out! Try it and please give them a star!
5. SHAP — A unified approach to explain the output of any machine learning model
https://github.com/slundberg/shap
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
Installation
SHAP can be installed from PyPI
pip install shap
or conda-forge
conda install -c conda-forge shap
Usage
There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.
Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT, as described in the SHAP NIPS paper that you can read here:
Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:
You can find more examples here:
Take a look. You’ll be surprised :)