Statistical Modelling vs Machine Learning
At times it may seem Machine Learning can be done these days without a sound statistical background but those people are not really understanding the different nuances. Code written to make it easier does not negate the need for an in-depth understanding of the problem.
A Statistical Model is the use of statistics to build a representation of the data and then conduct analysis to infer any relationships between variables or discover insights.
Machine Learning is the use of mathematical and or statistical models to obtain a general understanding of the data to make predictions.
Still to this day, many people in the industry use these two terms interchangebaly. While some may think there is no harm, a true "Data Scientist" must understand the distinction between the two.
Statistical modelling is a method of mathematically approximating the world. Statistical models contain variables that can be used to explain relationships between other variables. We use hypothesis testing, confidence intervals etc to make inferences and validate our hypothesis.
The classic example is Regression in which we take a single or number of variables to find the effect of each explanatory variable to the independent variable.
A statistical model will have sampling, probability spaces, assumptions and diagnostics etc, to make inferences.
We use statistical models to find insights given a particular set of data. We can conduct modelling on a relatively small set of data just to try and understand the underlying nature of the data.
Inherently all statistical models are wrong or not perfect. They are used to approximate reality. Sometimes the underlying assumptions of the model are far too strict and not representative of reality.
Machine learning is the science of getting computers to act without being explicitly programmed. - Andrew Ng
Machine Learning is the method of teaching the computer to learn like humans. This learning has more capacity than humans so when we have an enormous amount of data that is beyond a normal person's comprehension or capacity to understand the patterns in the data, the computational and storage power of computers can outshine a human.
Simply, we use machine learning to make predictions and we can assess its performance by how well it generalizes to new data that is has not learned yet.
We conduct cross-validation to validate the integrity of our data to make sure we do not make the model overfit (memorize) or underfit (not enough data to learn) our data.
The data is cleaned and organized in a manner that the machine can understand. There is no or very minimal statistics that go into this process.
Machine Learning predictions are assessed differently depending on its type: 'Classification', 'Regression', 'Clustering' or 'Supervised' and 'Unsupervised'. We can use error measures such as RMSE, MSE etc for regression and True Positive, False Positive etc for classification problems.
One cannot be done without the other. A true "Data Scientist" needs to have both in their arsenal. Machine learning foundations are from statistical theory and learning. At times it may seem Machine Learning can be done these days without a sound statistical background but those people are not really understanding the different nuances. Code written to make it easier does not negate the need for an in-depth understanding of the problem.
There are plenty of instances from which statistical modelling can solve the problem at hand without machine learning coming into the picture.
- 6 Key Concepts in Andrew Ng’s “Machine Learning Yearning”
- P-values Explained By Data Scientist
- All Models Are Wrong – What Does It Mean?