Top 10 Open Dataset Resources on Github
The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.
Over the past several months we have had a look at a number of top Github repository collections, such as:
- Top 10 Machine Learning Projects on Github
- Top 10 Deep Learning Projects on Github
- Top 10 Data Visualization Projects on Github
- Top 10 Data Science Resources on Github
- Top 10 IPython Notebook Tutorials for Data Science and Machine Learning
This post will be a bit different, in that we are looking at the top open dataset repositories that Github has to offer. The post was inspired by the Github Open Data Showcase, which is good, but which is not very large. Ideally, I would like to make a list of the top open datasets on Github, period; however, this gets tricky, since searching for "open data," or any variant of this search term, is going to lead to complications on a site set up with the explicit goal of sharing open source projects and their data.
I decided to take the offerings in this showcase which were not explicitly noted as being out of date and add in 3 additional strictly-dataset repos with the highest numbers of stars I could find from simple search, rank them all accordingly, and present them here. We have found at KDnuggets that datasets are one of the most sought-after pieces of the data science puzzle for many readers, and hopefully this fresh batch (at least, fresh from our perspective) is of use to some of our readers.
We are currently conducting our latest Annual KDnuggets Analytics Software Poll, and so the particular percentages from last year may change, but we know that open source tools have been used by 73% of data scientists in the past 12 months. While this number reflects software, and not data, it is easy to surmise that open data is a heavily-relied upon commodity in data science and related data-oriented disciplines for research, practice, and production alike, for myriad reasons.
So here they are, the open dataset repos with the highest number of stars as of the time of writing.
Stars: 14137, Forks: 1573
Brought to us by Xiaming (Sammy) Chen, this seems to be the undisputed leader of the open dataset collections available on Github. This curated list is organized by such topics as biology, sports, museums, and natural language, and appears to include several hundred datasets. Most are free, but there is a disclaimer at the top of the list that some are not. Xiaming also points out 2 other awesome-branded repo lists that contain more datasets; however, since those lists contain all sorts of other big data/machine learning/data science links, they will not be included in the list below, despite their high number of stars. Feel free to explore them on your own... obviously.
Stars: 529, Forks: 510
This is the official repo of OpenAddresses.io, the free and open global address collection. Why addresses?
Street address data is essential infrastructure. Street names, house numbers and zip codes, when combined with geographic coordinates, are the hub that connects digital to physical places. Precisely because of their connecting role, free and open addresses are rocket fuel for civic and commercial innovation.
Stars: 417, Forks: 187
This repo is is summed up by its description:
Members of the United States Congress, 1789-Present, in YAML, as well as committees, presidents, and vice presidents.
Stars: 300, Forks: 88
This is a catalog of all known discovered planets existing outside of our solar system. The database is generally updated within 24 hours of new discoveries, too, which means this is about as up-to-date as one could imagine; that the repo was last updated 20 days ago is encouraging in this respect. The README also points to this repo, should you be interested in a simple CSV of the data.
Stars: 274, Forks: 92
CitySDK is described as a "[u]ser-friendly [J]avascript SDK for US Census Bureau data," which also includes a number of samples detailing integration of the data with other open datasets. It refers to itself as a "toolbox" for civic hackers, and boasts latitude/longitude and ZIP code translation, and a modular architecture which makes integration with other data services straightforward. Use the API to create your own, custom dataset.
Stars: 236, Forks: 53
openFDA is a project by the FDA, which aims to bring a collection of FDA public datasets to researchers and developers via APIs, raw data, usage examples, and documentation. Data is noted as not being suited for clinical use, and one should assume no specific validity of any data results included within. Even with these disclaimers, there is no doubt that the data here would be great practice for those interested in the domain.
Stars: 100, Forks: 44
In case the name "Chicago Food Inspections Evaluation" didn't give it away, here's what to expect from this repo:
This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.
8. GSA Data
Stars: 92, Forks: 40
This contains various data published by the General Services Administration, which handles the basic functioning of federal agencies (offices, supplies, and the like). Specifically, it contains a collection of over 5000 .gov domains and their data.
Stars: 82, Forks: 21
From the repo's README:
Historic and current US Congressional districts as GeoJSON, versioned within Git
Stars: 79, Forks: 34
This is the source code for the CERN Open Data Portal, described as "the access point to a growing range of data produced through the research performed at CERN."