- Toward Increased k-means Clustering Efficiency with the Naive Sharding Centroid Initialization Method - Mar 13, 2017.
What if a simple, deterministic approach which did not rely on randomization could be used for centroid initialization? Naive sharding is such a method, and its time-saving and efficient results, though preliminary, are promising.
- Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data - Feb 14, 2017.
This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.
- Web Scraping for Dataset Curation, Part 1: Collecting Craft Beer Data - Feb 13, 2017.
This post is the first in a 2 part series on scraping and cleaning data from the web using Python. This first part is concerned with the scraping aspect, while the second part while focus on the cleaning. A concrete example is presented.
- NYC Taxi Hackathon – find privacy risks in public taxi datasets - Sep 19, 2016.
The NYC TLC has been a pioneer in sharing big data since 2010, but earlier data releases have been de-anonymized. TLC is considering releasing taxi data again, subject to a new anonymization method. This hackathon is to help test it.
- Apache Spark Key Terms, Explained - Jun 13, 2016.
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Pages: 1 2
- Apache Spark: RDD, DataFrame or Dataset? - Feb 3, 2016.
There are now 3 Apache Spark APIs. Here’s how to choose the right one.
Pages: 1 2
- KDnuggets™ News 16:n02, Jan 20: Research Leaders on Key Advances, Top Trends; Top 10 Deep Learning Projects - Jan 20, 2016.
Research Leaders on Data Mining, Data Science and Big Data key advances, top trends; Top 10 Deep Learning Projects on Github; Top 100 Big Data Experts to Follow; Yahoo Releases the Largest-ever Machine Learning Dataset.
- Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers - Jan 18, 2016.
Are you interested in massive amounts of data for research? Yahoo has just released the largest-ever machine learning dataset to the research community.
- KDnuggets™ News 15:n25, Aug 5: Largest Dataset Analyzed? Big Data & the Dog Question; Impact of IoT - Aug 5, 2015.
New Poll: Largest Dataset Analyzed/Data Mined?; Cartoon: Big Data and the dog question; Impact of IoT on Big Data Landscape; Data is Ugly - Tales of Data Cleaning.