-
Book: Mining of Massive Datasets, 2nd Edition, free download
The second edition of this landmark book adds Jure Leskovec as a coauthor and has 3 new chapters, on mining large graphs, dimensionality reduction, and machine learning. You can still freely download a PDF version.
-
3 Ways to Test the Accuracy of Your Predictive Models
3 different methods for testing accuracy of predictive models from 3 leading analytics experts - Karl Rexer, John Elder, and Dean Abbott explain using lift charts, randomization testing, and bootstrap sampling.
-
Split on Data Science Skills: Individual vs Team Approach
The results of latest KDnuggets poll show an almost equal split between those who favor individual and those who favor the team approach. See the counterintuitive regional differences and interesting comments.
-
PAN Competition: Plagiarism Detection, Author Identification, Author Profiling
Take part in one of 3 tasks: Plagiarism Detection - given a document, is it an original? Author Identification - given a document, who wrote it? Author Profiling - given a document, what is author age / gender?
-
Interpreting Model Performance with Cost Functions
Cost functions are critical for the correct assessment of performance of data mining and predictive models. This series goes deep into the statistical properties and mathematical understanding of each cost function and explores their similarities and differences.
-
MADlib: Big Data Machine Learning in SQL for Data Scientists
MADlib is open source with commercially usable BSD license; supports Postgres and Pivotal Greenplum DBMS, and provides classification, regression, clustering, topic modeling and other analytics for Big Data.
-
Unicorn Data Scientists vs Data Science Teams
A recent post has generated an intense discussion about finding "unicorn" data scientists with a combination of all the needed skills, or whether that skillset is best filled by a team. Here are the highlights, including a proposal how to train well-rounded data scientists.
-
Top stories for Dec 22-29: Data Mining Applications with R; “Data Scientist” catches up with “Statistician”
Data Mining Applications with R; "Data Scientist" catches up with "Statistician", surpasses "Data Miner"; What is Wrong with the Definition of Data Science.
-
Top Datasets on Reddit
Most popular dataset posts on Reddit include NFL Game Metadata, Reddit top 2.5 Million posts, Zillow housing prices, and, of course, a database of cat pictures.
-
What is Wrong with the Definition of Data Science
A veteran statistician argues that 3 different areas usually included in "Data Science" require dramatically different, skills, education, and training with very little overlap.
|