2018’s Top 7 Python Libraries for Data Science and AI
This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.
Editor's note: This post covers Favio's selections for the top 7 Python libraries of 2018. Tomorrow's post will cover his top 7 R packages of the year. You can also explore additional Python libraries for data science.
If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.
The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).
If you want to read the past digests, take a look here:
Weekly Digest for Data Science and AI - Revue
Weekly Digest for Data Science and AI - Personal newsletter of Favio Vázquez...www.getrevue.co
Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.
Top 7 for Python
7. AdaNet — Fast and flexible AutoML with learning guarantees.
This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.
You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.
You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.
adanet depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:
$ pip install "tensorflow>=1.7.0"
Installing from source
To install from source, you’ll first need to install
bazel following their installation instructions.
cd into its root directory:
$ git clone https://github.com/tensorflow/adanet && cd adanet
adanet root directory run the tests:
$ cd adanet $ bazel test -c opt //...
Once you have verified that everything works well, install
adanet as a pip package .
You’re now ready to experiment with
Here you can find two examples on the usage of the package:
Fast and flexible AutoML with learning guarantees. — tensorflow/adanetgithub.com
You can read more about it in the original blog post:
Introducing AdaNet: Fast and Flexible AutoML with Learning Guarantees
Posted by Charles Weill, Software Engineer, Google AI, NYC Ensemble learning , the art of combining different machine…ai.googleblog.com
6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.
The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming.
TPOT can automate a lot of stuff like feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.
What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
This is how it works:
For more details you can read theses great article by Matthew Mayo:
Using AutoML to Generate Machine Learning Pipelines with TPOT
Thus far in this series of posts we have: This post will take a different approach to constructing pipelines. Certainly…www.kdnuggets.com
and Randy Olson:
TPOT: A Python Tool for Automating Data Science
By Randy Olson, University of Pennsylvania. Machine learning is often touted as: A field of study that gives computers…www.kdnuggets.com
You actually need to follow some instructions before installing TPOT. Here they are:
Installation — TPOT
Optionally, you can install XGBoost if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is…epistasislab.github.io
After that you can just run:
pip install tpot
First let’s start with the basic Iris dataset:
So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the
iris.target. And then we save that pipeline. After that, what we have to do is very simple — load the
.py file you generated and you’ll see:
import numpy as np from sklearn.kernel_approximation import RBFSampler from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.tree import DecisionTreeClassifier # NOTE: Make sure that the class is labeled 'class' in the data file tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64) features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1) training_features, testing_features, training_classes, testing_classes = \ train_test_split(features, tpot_data['class'], random_state=42) exported_pipeline = make_pipeline( RBFSampler(gamma=0.8500000000000001), DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9) ) exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features)
And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.
Let’s go the MNIST dataset now:
As you can see, we did the same! Let’s load the
.py file you generated again and you’ll see:
import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier # NOTE: Make sure that the class is labeled 'class' in the data file tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64) features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1) training_features, testing_features, training_classes, testing_classes = \ train_test_split(features, tpot_data['class'], random_state=42) exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance") exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features)
Super easy and fun. Check them out! Try it and please give them a star!
5. SHAP — A unified approach to explain the output of any machine learning model
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
SHAP can be installed from PyPI
pip install shap
conda install -c conda-forge shap
There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.
Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT, as described in the SHAP NIPS paper that you can read here:
[1802.03888] Consistent Individualized Feature Attribution for Tree Ensembles
Abstract: Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is…arxiv.org
Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:
You can find more examples here:
A unified approach to explain the output of any machine learning model. — slundberg/shapgithub.com
Take a look. You’ll be surprised :)