KDnuggets Home » News » 2018 » Jan » Tutorials, Overviews » Learning Curves for Machine Learning ( 18:n04 )

# Learning Curves for Machine Learning

But how do we diagnose bias and variance in the first place? And what actions should we take once we've detected something? In this post, we'll learn how to answer both these questions using learning curves.

By Alex Olteanu, Student Success Specialist at Dataquest.io

When building machine learning models, we want to keep error as low as possible. Two major sources of error are bias and variance. If we managed to reduce these two, then we could build more accurate models.

But how do we diagnose bias and variance in the first place? And what actions should we take once we've detected something?

In this post, we'll learn how to answer both these questions using learning curves. We'll work with a real world data set and try to predict the electrical energy output of a power plant.

We'll generate learning curves while trying to predict the electrical energy output of a power plant. Image source: Pexels.

Some familiarity with scikit-learn and machine learning theory is assumed. If you don't frown when I say cross-validation or supervised learning, then you're good to go. If you're new to machine learning and have never tried scikit, a good place to start is this blog post.

We begin with a brief introduction to bias and variance.

In supervised learning, we assume there's a real relationship between feature(s) and target and estimate this unknown relationship with a model. Provided the assumption is true, there really is a model, which we'll call $\boldsymbol{\mathbf{\mathit{f}}}$, which describes perfectly the relationship between features and target.

In practice, $\boldsymbol{\mathbf{\mathit{f}}}$ is almost always completely unknown, and we try to estimate it with a model $\boldsymbol{\mathbf{\hat{\mathit{f}}}}$ (notice the slight difference in notation between $\boldsymbol{\mathbf{\mathit{f}}}$ and $\boldsymbol{\mathbf{\hat{\mathit{f}}}}$). We use a certain training set and get a certain $\boldsymbol{\mathbf{\hat{\mathit{f}}}}$. If we use a different training set, we are very likely to get a different $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$. As we keep changing training sets, we get different outputs for $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$. The amount by which $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$ varies as we change training sets is called variance.

To estimate the true $\boldsymbol{\mathbf{\mathit{f}}}$, we use different methods, like linear regression or random forests. Linear regression, for instance, assumes linearity between features and target. For most real-life scenarios, however, the true relationship between features and target is complicated and far from linear. Simplifying assumptions give bias to a model. The more erroneous the assumptions with respect to the true relationship, the higher the bias, and vice-versa.

Generally, a model $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$ will have some error when tested on some test data. It can be shown mathematically that both bias and variance can only add to a model's error. We want a low error, so we need to keep both bias and variance at their minimum. However, that's not quite possible. There's a trade-off between bias and variance.

A low-biased method fits training data very well. If we change training sets, we'll get significantly different models $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$.

You can see that a low-biased method captures most of the differences (even the minor ones) between the different training sets. $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$ varies a lot as we change training sets, and this indicates high variance.

The less biased a method, the greater its ability to fit data well. The greater this ability, the higher the variance. Hence, the lower the bias, the greater the variance.

The reverse also holds: the greater the bias, the lower the variance. A high-bias method builds simplistic models that generally don't fit well training data. As we change training sets, the models $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$ we get from a high-bias algorithm are, generally, not very different from one another.

If $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$ doesn't change too much as we change training sets, the variance is low, which proves our point: the greater the bias, the lower the variance.

Mathematically, it's clear why we want low bias and low variance. As mentioned above, bias and variance can only add to a model's error. From a more intuitive perspective though, we want low bias to avoid building a model that's too simple. In most cases, a simple model performs poorly on training data, and it's extremely likely to repeat the poor performance on test data.

Similarly, we want low variance to avoid building an overly complex model. Such a model fits almost perfectly all the data points in the training set. Training data, however, generally contains noise and is only a sample from a much larger population. An overly complex model captures that noise. And when tested on out-of-sample data, the performance is usually poor. That's because the model learns the sample training data too well. It knows a lot about something and little about anything else.

In practice, however, we need to accept a trade-off. We can't have both low bias and low variance, so we want to aim for something in the middle.

Source: Scott Fortmann-Roe - Understanding the Bias-Variance Tradeoff

We'll try to build some practical intuition for this trade-off as we generate and interpret learning curves below.

### Learning curves - the basic idea

Let's say we have some data and split it into a training set and validation set. We take one single instance (that's right, one!) from the training set and use it to estimate a model. Then we measure the model's error on the validation set and on that single training instance. The error on the training instance will be 0, since it's quite easy to perfectly fit a single data point. The error on the validation set, however, will be very large. That's because the model is built around a single instance, and it almost certainly won't be able to generalize accurately on data that hasn't seen before.

Now let's say that instead of one training instance, we take ten and repeat the error measurements. Then we take fifty, one hundred, five hundred, until we use our entire training set. The error scores will vary more or less as we change the training set.

We thus have two error scores to monitor: one for the validation set, and one for the training sets. If we plot the evolution of the two error scores as training sets change, we end up with two curves. These are called learning curves.

In a nutshell, a learning curve shows how error changes as the training set size increases. The diagram below should help you visualize the process described so far. On the training set column you can see that we constantly increase the size of the training sets. This causes a slight change in our models $\boldsymbol{\mathbf{\mathit{\hat{f}}}}$.

In the first row, where n = 1 (n is the number of training instances), the model fits perfectly that single training data point. However, the very same model fits really bad a validation set of 20 different data points. So the model's error is 0 on the training set, but much higher on the validation set.

As we increase the training set size, the model cannot fit perfectly anymore the training set. So the training error becomes larger. However, the model is trained on more data, so it manages to fit better the validation set. Thus, the validation error decreases. To remind you, the validation set stays the same across all three cases.

If we plotted the error scores for each training size, we'd get two learning curves looking similarly to these:

Learning curves give us an opportunity to diagnose bias and variance in supervised learning models. We'll see how that's possible in what follows.

### Introducing the data

The learning curves plotted above are idealized for teaching purposes. In practice, however, they usually look significantly different. So let's move the discussion in a practical setting by using some real-world data.

We'll try to build regression models that predict the hourly electrical energy output of a power plant. The data we use come from Turkish researchers Pınar Tüfekci and Heysem Kaya, and can be downloaded from here. As the data is stored in a .xlsx file, we use pandas' read_excel() function to read it in:

import pandas as pd

print(electricity.info())
electricity.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
AT    9568 non-null float64
V     9568 non-null float64
AP    9568 non-null float64
RH    9568 non-null float64
PE    9568 non-null float64
dtypes: float64(5)
memory usage: 373.8 KB
None

AT V AP RH PE
0 14.96 41.76 1024.07 73.17 463.26
1 25.18 62.96 1020.04 59.08 444.37
2 5.11 39.40 1012.16 92.14 488.56

Let's quickly decipher each column name:

ABBREVIATION FULL NAME
AT Ambiental Temperature
V Exhaust Vacuum
AP Ambiental Pressure
RH Relative Humidity
PE Electrical Energy Output

The PE column is the target variable, and it describes the net hourly electrical energy output. All the other variables are potential features, and the values for each are actually hourly averages (not net values, like for PE).

The electricity is generated by gas turbines, steam turbines, and heat recovery steam generators. According to the documentation of the data set, the vacuum level has an effect on steam turbines, while the other three variables affect the gas turbines. Consequently, we'll use all of the feature columns in our regression models.

At this step we'd normally put aside a test set, explore the training data thoroughly, remove any outliers, measure correlations, etc. For teaching purposes, however, we'll assume that's already done and jump straight to generate some learning curves. Before we start that, it's worth noticing that there are no missing values. Also, the numbers are unscaled, but we'll avoid using models that have problems with unscaled data.

### Deciding upon the training set sizes

Let's first decide what training set sizes we want to use for generating the learning curves.

The minimum value is 1. The maximum is given by the number of instances in the training set. Our training set has 9568 instances, so the maximum value is 9568.

However, we haven't yet put aside a validation set. We'll do that using an 80:20 ratio, ending up with a training set of 7654 instances (80%), and a validation set of 1914 instances (20%). Given that our training set will have 7654 instances, the maximum value we can use to generate our learning curves is 7654.

For our case, here, we use these six sizes:

train_sizes = [1, 100, 500, 2000, 5000, 7654]

An important thing to be aware of is that for each specified size a new model is trained. If you're using cross-validation, which we'll do in this post, k models will be trained for each training size (where k is given by the number of folds used for cross-validation). To save code running time, it's good practice to limit yourself to 5-10 training sizes.