Top 10 Data Science Resources on Github

The top 10 data science projects on Github are chiefly composed of a number of tutorials and educational resources for learning and doing data science. Have a look at the resources others are using and learning from.



In our latest inspection of Github repositories, we focus on "data science" projects. Unlike other searches we have performed over the past several months, nearly all of the repositories which show up (listed by number of stars* in descending order) are resources for learning data science, as opposed to tools for doing. As such, this is much less a software listing than it is a collection of tutorials and educational resources. There are, however, a few software surprises in here as well, such as a data science-oriented IDE and a great notebook-related project.

We include, however, the standard informational notification we have placed on our previous Github Top 10 lists: open source tools have been used by 73% of data scientists in the past 12 months, according to a recent KDnuggets survey (and accounting for the 12 months prior to the survey). While the following repositories focus mainly on learning resources, previous offerings have been software-heavy; also, open source learning materials are the new black, and a main source of learning for data scientists these days.

Data Science

Image: Research Hubs

1. Data Science iPython Notebooks

Stars: 5169, Forks: 902

Donne Martin has put together a great (and, apparently, wildly popular) resource for those looking for iPython notebooks for tutorials. The repo describes itself best:

Continually updated data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

2. The Open Source Data Science Masters

Stars: 4338, Forks: 2624

This is the official repository holding the curriculum of the Data Science Masters, the brainchild of data scientist Clare Corthell, designed as an open source alternative to formal data science education. With that in mind, this repo is a collection of materials for pursuit of this alternative route to data science mastery.

The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to making use of data.


Rodeo IDE

3. Rodeo

Stars: 2540, Forks: 229

Rodeo is a data science IDE. Developed by yhat, Rodeo is currently in version 1.0 of development. Rodeo's philosophy builds on iPython notebooks:

We originally built Rodeo because we like the Jupyter Notebook for presentations and tutorials, but thought it was a bit clunky for daily work. We wanted a one-stop IDE for Python with a good text editor, a simple plot window and a terminal with autocomplete.

4. Data Science Blogs

Stars: 2307, Forks: 259

This is a simple, but extensive, list of data science blogs, listed in alphabetical order. You'll find all the big blogs in here (including KDnuggets, of course), but also many smaller, off-the-beaten-path selections as well. The repo appears to be updated often, with the most recent updates happening only hours prior to this writing.

5. Awesome Data Science

Stars: 2142, Forks: 529

This is another of the Awesome... "brand" of curated lists. Straight to the point:

An open source Data Science repository to learn and apply towards solving real world problems.

Like other Awesome lists around (what, exactly, makes these lists more "awesome" than others?), there are countless resources broken down into several categories.

JHU Data Science

6. Data Science Specialization

Stars: 1986, Forks: 20800

This is a collection of the resources for the Johns Hopkins Data Science Specialization on Coursera. A wildly popular course with names like Roger Peng, Jeff Leek, and Brian Caffo attached to it, it is responsible for teaching data science and R to thousands of learners. Get all of the resources used in all of the courses collected here.

7. Data Science Specialization Community Site

Stars: 1153, Forks: 2307

This is a community-curated content companion site for the Johns Hopkins Data Science Specialization on Coursera.

A couple students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. This site is meant to serve as a central directory for community created content.

If you have a resource which would be useful to others in the program, a pull request can be submitted in order to have it included in the curated knowledge pages list.

8. Spark Notebook

Stars: 1087, Forks: 258

Andy Petrella forked scala-notebook and refactored it for massive dataset analysis with Apache Spark, and this is the result. From the repo:

The tool allows performing reproducible analysis with Scala, Apache Spark and more.

This is achieved through an interactive web-based editor that can combine Scala code, SQL queries, Markup or even JavaScript in a collaborative manner.

9. Learn Data Science

Stars: 993, Forks: 541

Nitin Borwankar has put together another compilation of resources for learning data science. It is a collection of iPython notebooks focusing on machine learning, specifically the topics of:

  • Linear Regression
  • Logistic Regression
  • Random Forests
  • K-Means Clustering

It appears to be a beginner's guide to fundamental concepts in machine learning, but a well-crafted one.

10. Data Science at the Command Line

Stars: 948, Forks: 260

Data Science at the Command Line

This repository contains the virtual machine, data, scripts, and custom command-line tools used in the book Data Science at the Command Line.

Included is the Data Science Toolbox, a virtual environment for data science. Author Jeroen Janssens' brand of data science includes the interplay of Python, R, numerous packages, and command line utilities. If you have read the book, or reading these few lines has captured your interest, give the repo a look.

* As viewed 6:00 PM EST, March 21, 2016.

Related: