Data Science Primer: Basic Concepts for Beginners
This collection of concise introductory data science tutorials cover topics including the difference between data mining and statistics, supervised vs. unsupervised learning, and the types of patterns we can mine from data.
What exactly is data science?
Data science is a multifaceted discipline, which encompasses machine learning and other analytic processes, statistics and related branches of mathematics, increasingly borrows from high performance scientific computing, all in order to ultimately extract insight from data and use this new-found information to tell stories.
New to this multifaceted discipline? Not sure where to being? This is a collection of short, not-too-technical overviews of particular topics of interest to data science newcomers, from basics like supervised vs. unsupervised learning to the importance of power law distributions and cognitive biases.
For data science beginners, 3 elementary issues are given overview treatment: supervised vs. unsupervised learning, decision tree pruning, and training vs. testing datasets.
When I was first exposed to data mining and machine learning, I'll admit it: I thought it was magic. Make significant predictions with accuracy? Sorcery! Curiosity, however, quickly leads you to discover that everything is above board, and sound scientific and statistical methods bear the responsibility.
But this ends up leading to more questions in the short term. Machine learning. Data mining. Statistics. Data science. The concepts and terminology are overlapping and seemingly repetitive at times. While there are numerous attempts at clarifying much of this (permanently unsettled) uncertainty, this post will tackle the relationship between data mining and statistics.
Data mining functionality can be broken down into 4 main "problems," namely: classification and regression (together: predictive analysis); cluster analysis; frequent pattern mining; and outlier analysis. There are all sorts of other ways you could break down data mining functionality as well, I suppose, e.g. focusing on algorithms, starting with supervised versus unsupervised learning, etc. However, this is a reasonable and accepted approach to identifying what data mining is able to accomplish, and as such these problems are each covered below, with a focus on what can be solved with each "problem."
This post will provide an overview of bagging, boosting, and stacking, arguably the most used and well-known of the basic ensemble methods. They are not, however, the only options. Random Forests is another example of an ensemble learner, which uses numerous decision trees in a single predictive model, and which is often overlooked and treated as a "regular" algorithm. There are other approaches to selecting effective algorithms as well, treated below.
Also known as scaling laws, power laws essentially imply that a small number of occurrences of some phenomena are frequent, or very common, while a large number of of occurrences of the same phenomena are infrequent, or very rare; the exact relationship between these relative frequencies differ between power law distributions. Some of the wide array of naturally occurring and man made phenomena which power laws are able to describe include income disparities, word frequencies of a given language, city sizes, website sizes, magnitudes of earthquakes, book sales rankings, and surname popularity.
A few specific examples of how cognitive biases can (and do) interfere in the real world include:
- Voters and politicians who don't understand science, but think they do, doubt climate change because it still snows in the winter (Dunning–Kruger effect)
- Confirmation bias very recently prevented pollsters from believing any data showing that Donald Trump could win the US Presidential election