6 Predictive Models Every Beginner Data Scientist Should Master
Data Science models come with different flavors and techniques — luckily, most advanced models are based on a couple of fundamentals. Which models should you learn when you want to begin a career as Data Scientist? This post brings you 6 models that are widely used in the industry, either in standalone form or as a building block for other advanced techniques.
By Ivo Bernardo, Data Scientist
Photo by @barnimages — unsplash.com
As you fall into the hype vortex of Machine Learning and Artificial Intelligence, it seems that only advanced techniques will solve all your problems when you want to build a predictive model. But, as you get your hands dirty in the code, you find out that the truth is very, very different. A lot of the problems you will face as a data scientist are solved with a combination of several models and most of them have been around for ages.
And, even if you solve problems using more advanced models, learning the fundamentals will give you an head start in most discussions. Particularly, learning the benefits and short-comes of more simple models will help you steer a data science project for success. The truth is: advanced models are able to do two things — amplify or amend some of the flaws of simpler models that they are based on.
That being said, let’s jump into the DS world and know about 6 models that you should learn and master when you want to be a Data Scientist.
One of the oldest models (an example, Francis Galton used the term “Regression” in the 19th century) around and still one of the most effective to represent linear relationships using data.
Studying linear regression is a staple in econometric classes all around the world — learning this linear model will give you a good intuition behind solving regression problems (one of the most common problems to solve with ML) and also understand how you can build a simple line to predict phenomena using math.
There are also other benefits on learning Linear Regression — particularly when you learn both methods available to achieve the best performance:
- Closed form solution, an almost magical formula that gives you the weights of the variables with a simple algebra equation.
- Gradient Descent, an optimization method that progresses toward the optimum weights and that is used to optimize other types of algorithms.
Additionally, the fact that we can visualize Linear Regression in practice using a simple 2-D plot makes this model a really good start to understand algorithms.
Some resources to learn about it:
- DataCamp’s Linear Regression explanation
- Sklearn’s Regression Implementation
- R For Data Science Udemy Course Linear Regression Section
Although named Regression, Logistic Regression is the best model to start your mastery on Classification Problems.
There are several benefits on learning Logistic Regression, namely:
- Having a first glance at classification and multi-classification problems (a huge part of ML tasks).
- Understand function transformations such as the one done by the Sigmoid Function.
- Understand the usage of other functions for Gradient Descent and how it is agnostic to the function to optimize.
- First glance at Log-Loss function.
What should you expect to know after studying Logistic Regression? You will able to understand the mechanism behind Classification Problems and how you can use Machine Learning to separate classes. Some problems that fall into this category:
- Understanding if a transaction is fraudulent or not.
- Understanding if a customer will churn or not.
- Classifying loans according to their probability of default.
Just like Linear Regression, the Logistic is also a linear algorithm — after studying both of them, you will get to know the main limitations behind linear algorithms and how they fail to represent many real-world complexities.
Some resources to learn about it:
- DataCamp’s Logistic Regression in R explanation
- Sklearn’s Logistic Regression Implementation
- R For Data Science Udemy Course — Classification Problems Section
The first non-linear algorithm to study should be the Decision Tree. A fairly simple and explainable algorithm based on if-else rules, the Decision Tree will give you a good grasp on non-linear algorithms and their advantages and disadvantages.
Decision Trees are the building block of all tree-based models — by learning them you will also be prepared to study other techniques such as XGBoost or LightGBM (more about them, below).
The cool part is that Decision Trees apply to both Regression and Classification problems, with minimum differences between the two — the rationale behind choosing the best variables that influence an outcome is roughly the same, you just switch the criteria to do it — in this case, the error measure.
Although you have the concept of hyper-parameters for regression (such as the regularization parameter), in Decision Trees they are of extreme importance, being able to draw the line between a good and a model that is an absolute garbage. Hyper parameters will be essential on your journey in ML, and Decision Trees are an excellent opportunity to test them.
Some resources about decision trees:
- LucidChart Decision Tree Explanation
- Sklearn’s Decision Tree Explanation
- My blog post about Classification Decision Trees
- R For Data Science Udemy Course —Tree Based Models Section
Due to their sensitivity to hyper-parameters and fairly simple assumptions, Decision Trees are fairly limited in their outcome. As you study them, you will understand that they are really prone to over-fitting, creating models that don’t generalize for the future.
The concept of Random Forest is really simple — if Decision Trees are a dictatorship, Random Forests are a democracy. They help to diversify across different decision trees and this helps to bring robustness to your algorithm — just like decision trees, you can configure a ton of hyper-parameters to enhance the performance of this Bagging model. What’s Bagging? A really important concept in ML that brings stability to different models — you just use the average or a voting mechanism to transform the result of different models into a single approach.
In practice, Random Forest trains a fixed amount of Decision Trees and (normally) averages the results from all those previous models — and just like Decision Trees, we have Classification and Regression Random Forests. If you’ve heard about the concept Wisdom of the Crowds, bagging models apply that concept to ML models training.
Some resources to learn about the Random Forest algorithm:
- Tony Yiu’s Medium post about Random Forests
- Sklearn’s Random Forest Classifier implementation
- R For Data Science Udemy Course — Tree Based Models Section
Other algorithms based on Decision Trees that brings them stability are XGBoost or LightGBM. These models are boosting algorithms, they work on errors made by previous weak learners to find patterns that are more robust and generalize better.
This stream of thought regarding Machine Learning models, that gained traction after Michael Kearns’s paper on Weak Learners and Hypothesis Testing, showcases that boosting models may be an excellent solution for the overall bias/variance trade-off that models suffer. Additionally, these models are some of the favorite choices to apply in Kaggle competitions.
XGBoost and LightGBM are two famous implementations of Boosting algorithms. Some resources to learn about them:
- Microsoft’s Lightgbm GitHub page
- Pranjal Khandelwal’s article on XGBoost vs. LightGBM
- Vishal Morde’s Medium Post about XGBoost
Artificial Neural Networks
Finally, the current holy grail of predictive models— Artificial Neural Networks (ANNs).
ANNs are currently one of the best models to find non-linear patterns in data and to build really complex relationships between independent and dependent variables. By learning them you will be exposed to the concepts of activation function, back-propagation and neural network layers — these concepts should give you good foundations to study Deep Learning models.
Additionally, Neural Networks have ton of different flavors when it comes to their architecture — studying the most basic ones will build the blocks to jump to other types of models such as Recurrent Neural Networks (mostly used in Natural Language Processing) and Convolutional Neural Networks (mostly used in Computer Vision).
Some extra resources to learn about them:
- IBM “What are Neural Networks” article
- Keras (Neural Network implementation and abstraction) documentation
- Sanchit Tanwar’s article about Building your First Neural Network
And, that’s it! These models should give you a nice head start in Data Science and Machine Learning. By learning them you will be prepared to learn more advanced models and easily grasp the math behind those models.
The good part is that the more advanced stuff is normally based on the 6 models I’ve presented here, so knowing their underlying math and mechanisms will never hurt, even in projects where you need to bring the “big guns”.
Do you think there is something missing? Write down in the comments below, I would love to hear your opinion.
I’ve set up a course on learning most of these models in a Udemy course — the course is suitable for beginners and I would love to have you around.
Bio: Ivo Bernardo is a Partner & Data Scientist @ DareData Engineering, and a Udemy Bestseller Instructor and Teacher.
Original. Reposted with permission.