- Scikit-Learn & More for Synthetic Dataset Generation for Machine Learning - Sep 19, 2019.
While mature algorithms and extensive open-source libraries are widely available for machine learning practitioners, sufficient data to apply these techniques remains a core challenge. Discover how to leverage scikit-learn and other tools to generate synthetic data appropriate for optimizing and fine-tuning your models.
- Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends - Oct 29, 2018.
The poll results show amazing consistency to past years, with median answers still in 10-100 gigabytes range. Really Big Data Scientists (100 Petabytes and more) continue to stand apart, but remain small segment where Asian data scientists lead for the first time in this poll.
- Financial Entity Identification and Information Integration Challenge 2018 - Mar 13, 2018.
Take part in the Financial Entity Challenges. Sign up to participate, download the data, submit your solution, and come talk about your work at the ACM DSMM 2018 Workshop.
- Toward Increased k-means Clustering Efficiency with the Naive Sharding Centroid Initialization Method - Mar 13, 2017.
What if a simple, deterministic approach which did not rely on randomization could be used for centroid initialization? Naive sharding is such a method, and its time-saving and efficient results, though preliminary, are promising.
- Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data - Feb 14, 2017.
This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.
- Web Scraping for Dataset Curation, Part 1: Collecting Craft Beer Data - Feb 13, 2017.
This post is the first in a 2 part series on scraping and cleaning data from the web using Python. This first part is concerned with the scraping aspect, while the second part while focus on the cleaning. A concrete example is presented.
- NYC Taxi Hackathon – find privacy risks in public taxi datasets - Sep 19, 2016.
The NYC TLC has been a pioneer in sharing big data since 2010, but earlier data releases have been de-anonymized. TLC is considering releasing taxi data again, subject to a new anonymization method. This hackathon is to help test it.
- Apache Spark Key Terms, Explained - Jun 13, 2016.
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Pages: 1 2
- Apache Spark: RDD, DataFrame or Dataset? - Feb 3, 2016.
There are now 3 Apache Spark APIs. Here’s how to choose the right one.
Pages: 1 2
- KDnuggets™ News 16:n02, Jan 20: Research Leaders on Key Advances, Top Trends; Top 10 Deep Learning Projects - Jan 20, 2016.
Research Leaders on Data Mining, Data Science and Big Data key advances, top trends; Top 10 Deep Learning Projects on Github; Top 100 Big Data Experts to Follow; Yahoo Releases the Largest-ever Machine Learning Dataset.
- Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers - Jan 18, 2016.
Are you interested in massive amounts of data for research? Yahoo has just released the largest-ever machine learning dataset to the research community.
- KDnuggets™ News 15:n25, Aug 5: Largest Dataset Analyzed? Big Data & the Dog Question; Impact of IoT - Aug 5, 2015.
New Poll: Largest Dataset Analyzed/Data Mined?; Cartoon: Big Data and the dog question; Impact of IoT on Big Data Landscape; Data is Ugly - Tales of Data Cleaning.