Gold BlogGitHub Python Data Science Spotlight: AutoML, NLP, Visualization, ML Workflows

This post includes a wide spectrum of data science projects, all of which are open source and are present on GitHub repositories.



This post will spotlight a select group of open source Python data science projects with GitHub repos.

A previous incarnation if this post series detailed "machine learning projects you could no longer overlook." This time around we expand to include a wider spectrum of data science projects, all of which are open source and reside on GitHub. The list is clearly subjective, being composed of code I have come across and found interesting or useful for one reason or another. For each entry I have included links to the respective repos, documentation, a getting started guide or similar, and a descriptive excerpt from documentation.

Sit back and enjoy these projects which you may or may not be familiar with, and hopefully you find something you can use in your own work.

 

1. Auto-Keras - This is an automated machine learning (AutoML) package

 
Repository: https://github.com/jhfjhfj1/autokeras
Documentation: http://autokeras.com
Getting started: https://autokeras.com/#example

Auto-Keras is an open source software library for automated machine learning (AutoML). The ultimate goal of AutoML is to allow domain experts with limited data science or machine learning background easily accessible to deep learning models. Auto-Keras provides functions to automatically search for architecture and hyperparameters of deep learning models.

 

2. Finetune - Scikit-learn style model finetuning for NLP

 
Repository: https://github.com/IndicoDataSolutions/finetune
Documentation: https://finetune.indico.io
Getting started: https://finetune.indico.io

Finetune ships with a pre-trained language model from "Improving Language Understanding by Generative Pre-Training" and builds off the OpenAI/finetune-language-model repository. Huge thanks to Alec Radford for his hard work and quality research.

 

3. GluonNLP - NLP made easy

 
Repository: https://github.com/dmlc/gluon-nlp
Documentation: http://gluon-nlp.mxnet.io
Getting started: https://github.com/dmlc/gluon-nlp#quick-start-guide

GluonNLP is a toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your Natural Language Processing (NLP) research.

 

4. animatplot - A python package for animating plots build on matplotlib

 
Repository: https://github.com/t-makaro/animatplot
Documentation: https://animatplot.readthedocs.io/en/latest
Getting started: https://animatplot.readthedocs.io/en/latest/tutorial/getting_started.html

Note: Documentation to pull quotes from for this project is slim, so here's something more appropriate, all things considered:

 

5. MLflow - Open source platform for the machine learning lifecycle

 
Repository: https://github.com/mlflow/mlflow
Documentation: https://mlflow.org/docs/latest/index.html
Getting started: https://mlflow.org/docs/latest/quickstart.html

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions:

  • Tracking experiments to record and compare parameters and results (MLflow Tracking).
  • Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects).
  • Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).

MLflow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI. For convenience, the project also includes a Python API.

 
Related: