Gold Blog 7 Steps to Mastering Intermediate Machine Learning with Python — 2019 Edition

This is the second part of this new learning path series for mastering machine learning with Python. Check out these 7 steps to help master intermediate machine learning with Python!

Are you interested in learning more about machine learning with Python?

I recently wrote 7 Steps to Mastering Basic Machine Learning with Python — 2019 Edition, a first step in an attempt to updated a pair of posts I wrote some time back (7 Steps to Mastering Machine Learning With Python and 7 More Steps to Mastering Machine Learning With Python), a pair of posts which are getting stale at this point, having been around for a few years. It's time to add on to the "basic" post with a set of steps for learning "intermediate" level machine learning with Python.

We're talking "intermediate" in a relative sense, however, so do not expect to be a research-caliber machine learning engineer after getting through this post. The learning path is aimed at those with some understanding of programming, computer science concepts, and/or machine learning in an abstract sense, who are wanting to be able to use the implementations of machine learning algorithms of the prevalent Python libraries to build their own machine learning models.

Header image

This post, as those which came before, will leverage the existing tutorials, videos, and works of a variety of folks, so the thanks for anything included herein should be directed at them.

Instead of having a high number of resources for each topic step (say, dimensionality reduction), I have tried to select a quality tutorial or two, along with an accessible video preliminarily describing the underlying theory, math, or intuition of the given topic when appropriate.

These steps deal with machine learning algorithms, the importance of feature selection and engineering, model training, transfer learning, and more.

So grab a drink, sit back, and read the second installment in the series, and start mastering intermediate machine learning with Python in these 7 steps.


1. Getting Started

It probably goes without saying, but your first step should be to review the previous post in this series, 7 Steps to Mastering Basic Machine Learning with Python — 2019 Edition.

It might also be a good idea to keep Google's Machine Learning Glossary close by for reference along the way, or to have a quick look at beforehand.

The quick start guide from the official documentation of each of the following Python libraries which are used to do a lot of the heavy lifting for machine learning and other data analysis tasks herein are likely a good idea to have a look at as well:

Now, on to the fun stuff.


2. Feature Selection

A feature is a variable from the input dataset which can be used to help make predictions. All features are not created equal, however, and sometimes the raw features provided need to be used to engineer new features which can more useful in this pursuit of prediction.

Read this article on Feature Selection Techniques in Machine Learning with Python by Raheel Shaikh to get an understanding of ways to go about feature selection techniques, and how they are approached in Python.

Next, read Beware Default Random Forest Importances by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard, an informative take on why "[t]he scikit-learn Random Forest feature importance and R's default Random Forest feature importance strategies are biased." Random Forest is a common method for selecting features to use in prediction, based on their importance, and this article provides insight into why blindly using any particular method for doing so is not a great idea.

Finally, have a look at this article, Step Forward Feature Selection: A Practical Example in Python, which demonstrates an implementation of step forward feature selection, a disciplined, statistical approach to the task.


3. Feature Engineering

Sometimes all of the raw features, or some subset of them, can be used as is for prediction purposes. Other times, new features can be engineered from the existing in order to facilitate better prediction.

Take, for example, a simple date. This date, as is, could be useless for prediction. Knowing whether that date was a weekday or weekend, or if it were a statutory holiday, might be incredibly helpful, however. Using this raw date to create a new, more useful feature is a simple example of feature engineering.

First, read this article on feature engineering from Google's Machine Learning Crash Course for an overview on the topic.

Then read Will Koehrsen's Feature Engineering: What Powers Machine Learning for more information on the topic, which brings Python into the picture.

Finally, read Automated Feature Engineering in Python, also by Will Koehrsen, for an introduction on how feature engineering can be automated and outsourced to the algorithms.


4. Even More Classification

Next, let's turn our attention to some classification algorithms. The first part of this series considered logistic regression, decision trees, and support vector machines. This time around we will look at a pair of other well-used and time-tested techniques: k-nearest neighbors and Naive Bayes.

First, watch this short video from StatQuest explaining what k-nearest neighbors is (k-NN), and what its approach to classification is.

Then read this article by Sam Grassi, Building & Improving a K-Nearest Neighbors Algorithm in Python, which first uses Scikit-learn's k-NN implementation to perform classification, and then implements k-NN from scratch in Python for comparison.

After learning about k-NN, we turn our attention to Naive Bayes. Watch this video, also from StatQuest, in order to establish the intuition behind the algorithm.

Then it's time to get practical. Follow this tutorial, Naive Bayes Classification With Sklearn, by Martin Müller, to learn more as you use Scikit-learn's implementation to build a classifier.


5. Model Training & Selection

On to model training and selection.

There are 2 major points here. First, without trained models, we can't make predictions. And second, with a number of models, we need to select the "best" among them.

We will first look at the concept of training, testing, and validation sets. Read this article by Tarang Shah, About Train, Validation and Test Sets in Machine Learning, which introduces the concepts.

Then look at this video from StatQuest on the related concept of cross-validation.

These concepts above are integral to model training. But what about once we have a number of models? How do we choose between them?

First, this video from StatQuest explains what a confusion matrix is, and how it helps summarize a machine learning classifier outcomes.

Dig deeper into this topic with another StatQuest video on the topic of model sensitivity and specificity.

After, read up on machine learning classification metrics, and how they can be practically implemented in Python using Scikit-learn. This article from Andrew Long, Data Science Performance Metrics for Everyone, introduces metrics, so start here.

Then follow along in another of Andrew Long's articles, Understanding Data Science Classification Metrics in Scikit-Learn in Python, and learn how classification metrics can be implements in Python.

Finally, get a sense of how to choose among evaluation metrics by reading Alvira Swalin's Choosing the Right Metric for Evaluating Machine Learning Models  –  Part 1, and Choosing the Right Metric for Evaluating Machine Learning Models  –  Part 2, a great pair of resources on how to approach this process.


6. Dimensionality Reduction

What is dimensionality reduction? Well, since you've asked, have a look at this video from Stanford's Jure Leskovec explaining just that.

One of the most used forms of dimensionality reduction is a technique known as principal component analysis (PCA), a type of transformation which converts a dataset of possibly correlated variables to linearly uncorrelated variables; these linearly uncorrelated variables are known as principal components. Watch this video from StatQuest outlining PCA in more detail.

Now have a look at this article by Zichen Wang, PCA and SVD explained with numpy, demonstrating an implementation of PCA — along with singular value decomposition (SVD), another popular dimensionality reduction technique — from scratch in Python using Numpy.

Finally, the excerpt chapter In Depth: Principal Component Analysis from Jake VanderPlas' book "Python Data Science Handbook" gives a thorough treatment to implementing PCA using Scitkit-learn.


7. Transfer Learning

Transfer learning is the repurposed use of a model for a task other than the one it was originally trained for. Of course, transfer learning is much more complex than this simple one sentence explanation, but it conveys the basic idea.

Watch this video from Kaggle, which better describes what transfer learning is, and what it can do.

Then read Sebastian Ruder's overview of transfer learning concepts, Transfer Learning - Machine Learning's Next Frontier. The post is a couple of years old, and transfer learning moves quickly, but the concepts covered are no less valid today than the day it was written.

To see the value of transfer learning let's have a look at how it can be used for image classification (one of its greatest accomplishments) using neural networks created with the deep learning library Keras. If you are not familiar with Keras, have a look at this quick start guide, Introduction to Deep Learning with Keras, written by Gilbert Tanner.

Now put transfer learning to practical use with this tutorial written by George Seif, Transfer Learning for Image Classification using Keras. This should be enough to demonstrate how powerful transfer learning is, not just in image classification but also in natural language processing tasks and beyond.

I hope you have found value in these 7 steps to mastering intermediate machine learning with Python. Join us next time when we will move on to some more advanced topics.