Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2018 » Oct » Tutorials, Overviews » GitHub Python Data Science Spotlight: High Level Machine Learning & NLP, Ensembles, Command Line Viz & Docker Made Easy ( 18:n38 )

Silver BlogGitHub Python Data Science Spotlight: High Level Machine Learning & NLP, Ensembles, Command Line Viz & Docker Made Easy

This post spotlights 5 data science projects, all of which are open source and are present on GitHub repositories, focusing on high level machine learning libraries and low level support tools.

This post will spotlight a select group of open source Python data science projects with GitHub repos.

A previous post included some libraries covering AutoML, natural language processing, data visualization, machine learning workflows. This time around we will look at another selection of data science projects and their GitHub repos, focusing on those which provide a helpful layer of abstraction on one end, and those assisting in lower level supportive activities on the other.

The list is clearly subjective, being composed of code I have come across and found interesting or useful for one reason or another. For each entry I have included links to the respective repos, documentation, a getting started guide or similar, and a descriptive excerpt from documentation.

Sit back and enjoy these projects which you may or may not be familiar with, and hopefully you find something you can use in your own work.


1. fastai

Getting started:

The library sits on top of PyTorch v1 (released today in preview), and provides a single consistent API to the most important deep learning applications and data types.’s recent research breakthroughs are embedded in the software, resulting in significantly improved accuracy and speed over other deep learning libraries, whilst requiring dramatically less code. You can download it today from conda, pip, or GitHub or use it on Google Cloud Platform. AWS support is coming soon.


2. textacy

Getting started:

textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals --- tokenization, part-of-speech tagging, dependency parsing, etc. --- delegated to another library, textacy focuses on the tasks that come before and follow after.


3. pycobra

Getting started:

pycobra is a python library for ensemble learning. It serves as a toolkit for regression and classification using these ensembled machines, and also for visualisation of the performance of the new machine and constituent machines. Here, when we say machine, we mean any predictor or machine learning object - it could be a LASSO regressor, or even a Neural Network. It is scikit-learn compatible and fits into the existing scikit-learn ecosystem.


4. Termgraph

Repository, Documentation & Getting started:

A python command-line tool which draws basic graphs in the terminal.

Graph types supported:

  • Bar Graphs
  • Color charts
  • Multi-variable
  • Stacked charts
  • Horizontal or Vertical
  • Emoji!

Most results can be copied and pasted wherever you like, since they use standard block characters. However the color charts will not show, since they use terminal escape codes for color.


5. repo2docker

Getting started:

jupyter-repo2docker is a tool to build, run, and push Docker images from source code repositories that run via a Jupyter server.

repo2docker fetches a repository (e.g., from GitHub or other locations) and builds a container image based on the configuration files found in the repository. It can be used to explore a repository locally by building and executing the constructed image of the repository, or as a means of building images that are pushed to a Docker registry.


Sign Up