Top Datasets on Reddit
Most popular dataset posts on Reddit include NFL Game Metadata, Reddit top 2.5 Million posts, Zillow housing prices, and, of course, a database of cat pictures.
By Gregory Piatetsky, Dec 28, 2013.
The top datasets for December 2013 include
NFL Game Metadata Since 1980 (CSV file). mapItOut reddit user explains how to link the metadata with the results:
- Download the schedule and results as a CSV from pro football reference for each season that you want (example: www.pro-football-reference.com/years/2007/games.htm). Add a year variable to each file.
- Stack up all the CSV files into a single CSV.
- Using the date variable and the year variable that you added, construct an ID variable that looks like one in the metadata file: yyyymmdd0[home team abbreviation]. You'll probably need to look through the metadata to get all the team abbreviations, but they look pretty self-explanatory ("den" for Denver, "dal" for Dallas, etc.).
- Merge the results data onto the metadata by that ID.
Top 2.5 Million posts. This is a dataset of the all-time top 1,000 posts, from the top 2,500 subreddits by subscribers, pulled from reddit between August 15-20, 2013.
The 911Dataset Project: 3TB across 254,822 files.
Average wait times for emergency rooms across the country, from [ProPublica/CMMS].
The top reddit dataset posts for 2013 include:
Our DaaS platform Quandl is a free and open index of currently over 4 million datasets that is growing daily. We also released a Python package to go with our R, MATLAB, and excel ones this week. They allow easy API access to every single one. www.quandl.com/
A generous sample of their data from the greater Phoenix, AZ metropolitan area including: 11,537 businesses - 8,282 checkin sets - 43,873 users - and 229,907 reviews.
Database of Cat Pictures. (really, and not /r/pics).
The CAT dataset includes 10,000 cat images. Each image has annotations of the cat head with nine points, two for eyes, one for mouth, and six for ears.
Zillow Housing Data, including house price, rental rate and sales data for 37,000 locations.
The "gilded" posts include:
- Idilia a dataset of one million WikiPedia articles with their first 5000 words sense annotated
Happy data mining, and check also KDnuggets