KDnuggets Home » News :: 2013 :: Dec :: News, Software :: Top Datasets on Reddit ( 14:n01 )

Top Datasets on Reddit

          

Most popular dataset posts on Reddit include NFL Game Metadata, Reddit top 2.5 Million posts, Zillow housing prices, and, of course, a database of cat pictures.

By Gregory Piatetsky, Dec 28, 2013.

Thanks to +RichGillin for a pointer to a Reddit page on Datasets Redditwww.reddit.com/r/datasets/

The top datasets for December 2013 include

NFL Game Metadata Since 1980 (CSV file). mapItOut reddit user explains how to link the metadata with the results:

  • Download the schedule and results as a CSV from pro football reference for each season that you want (example: www.pro-football-reference.com/years/2007/games.htm). Add a year variable to each file.
  • Stack up all the CSV files into a single CSV.
  • Using the date variable and the year variable that you added, construct an ID variable that looks like one in the metadata file: yyyymmdd0[home team abbreviation]. You'll probably need to look through the metadata to get all the team abbreviations, but they look pretty self-explanatory ("den" for Denver, "dal" for Dallas, etc.).
  • Merge the results data onto the metadata by that ID.

Top 2.5 Million posts. This is a dataset of the all-time top 1,000 posts, from the top 2,500 subreddits by subscribers, pulled from reddit between August 15-20, 2013.

The 911Dataset ProjectThe 911Dataset Project: 3TB across 254,822 files.

Average wait times for emergency rooms across the country, from [ProPublica/CMMS].

The top reddit dataset posts for 2013 include:

You can haz datasets! We now have over 4M financial, economic, and social datasets available.

Our DaaS platform Quandl is a free and open index of currently over 4 million datasets that is growing daily. We also released a Python package to go with our R, MATLAB, and excel ones this week. They allow easy API access to every single one. www.quandl.com/

173+ publicly available Social Network Datasets

Yelp Dataset Challenge

A generous sample of their data from the greater Phoenix, AZ metropolitan area including: 11,537 businesses - 8,282 checkin sets - 43,873 users - and 229,907 reviews.

Cats PicturesDatabase of Cat Pictures. (really, and not /r/pics).

The CAT dataset includes 10,000 cat images. Each image has annotations of the cat head with nine points, two for eyes, one for mouth, and six for ears.

Zillow Housing Data, including house price, rental rate and sales data for 37,000 locations.

The "gilded" posts include:

Happy data mining, and check also KDnuggets


KDnuggets Home » News :: 2013 :: Dec :: News, Software :: Top Datasets on Reddit ( 14:n01 )