Using Linear Regression for Predictive Modeling in R

In this post, we’ll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure.

comments

How well does the model fit the data?

Residuals:

This section of the output provides us with a summary of the residuals (recall that these are the distances between our observation and the model), which tells us something about how well our model fit our data. The residuals should have a pretty symmetrical distribution around zero. Generally, we’re looking for the residuals to be normally distributed around zero (i.e. a bell curve distribution), but the important thing is that there’s no visually obvious pattern to them, which would indicate that a linear model is not appropriate for the data.

We can make a histogram to visualize this using ggplot2.

ggplot(data=trees, aes(fit_1$residuals)) + 
  geom_histogram(binwidth = 1, color = "black", fill = "purple4") +
  theme(panel.background = element_rect(fill = "white"),
        axis.line.x=element_line(),
        axis.line.y=element_line()) +
  ggtitle("Histogram for Model Residuals")

hist-1
Our residuals look pretty symmetrical around 0, suggesting that our model fits the data well.

Residual standard error:

This term represents the average amount that our response variable measurements deviate from the fitted linear model (the model error term).

Degrees of freedom (DoF):

Discussion of degrees of freedom can become rather technical. For the purposes of this post, it is sufficient to think of them as the number of independent pieces of information that were used to calculate an estimate. DoF are related to, but not the same as, the number of measurements.

Multiple R-squared:

The R² value is a measure of how close our data are to the linear regression model. R²values are always between 0 and 1; numbers closer to 1 represent well-fitting models. R² always increases as more variables are included in the model, and so adjusted R² is included to account for the number of independent variables used to make the model.

F statistic:

This test statistic tells us if there is a relationship between the dependent and independent variables we are testing. Generally, a large F indicates a stronger relationship.

p-value:

This p-value is associated with the F statistic, and is used to interpret the significance for the whole model fit to our data.

Let’s have a look at our model fitted to our data for width and volume. We can do this by using ggplot() to fit a linear model to a scatter plot of our data:

ggplot(data = trees, aes(x = Girth, y = Volume)) + geom_point()  +
  stat_smooth(method = "lm", col = "dodgerblue3") +
  theme(panel.background = element_rect(fill = "white"),
        axis.line.x=element_line(),
        axis.line.y=element_line()) +
  ggtitle("Linear Model Fitted to Data")

The gray shading around the line represents a confidence interval of 0.95, the default for the stat_smooth() function, which smoothes data to make patterns easier to visualize. This 0.95 confidence interval is the probability that the true linear model for the girth and volume of all black cherry trees will lie within the confidence interval of the regression model fitted to our data.

Even though this model fits our data quite well, there is still variability within our observations. This is because the world is generally untidy. In our model, tree volume is not just a function of tree girth, but also of things we don't necessarily have data to quantify (individual differences between tree trunk shape, small differences in foresters’ trunk girth measurement techniques).

Sometimes, this variability obscures any relationship that may exist between response and predictor variables. But here, the signal in our data is strong enough to let us develop a useful model for making predictions.

Using our simple linear model to make predictions

Our model is suitable for making predictions! Tree scientists everywhere rejoice. Let’s say we have girth, height and volume data for a tree that was left out of the data set. We can use this tree to test our model.

Girth	Height	Volume
18.2 in	72 ft	46.2 ft³

How well will our model do at predicting that tree’s volume from its girth?

We’ll use the predict() function, a generic R function for making predictions from modults of model-fitting functions. predict() takes as arguments our linear regression model and the values of the predictor variable that we want response variable values for.

predict(fit_1, data.frame(Girth = 18.2))

Our volume prediction is 55.2 ft³.

This is close to our actual value, but it's possible that adding height, our other predictive variable, to our model may allow us to make better predictions.

Adding more predictors: multiple linear regression

Maybe we can improve our model's predictive ability if we use all the information we have available (width and height) to make predictions about tree volume. It’s important that the five-step process from the beginning of the post is really an iterative process – in the real world, you’d get some data, build a model, tweak the model as needed to improve it, then maybe add more data and build a new model, and so on, until you’re happy with the results and/or confident that you can’t do any better.

We could build two separate regression models and evaluate them, but there are a few problems with this approach. First, imagine how cumbersome it would be if we had 5, 10, or even 50 predictor variables. Second, two predictive models would give us two separate predictions for volume rather than the single prediction we’re after. Perhaps most importantly, building two separate models doesn’t let us account for relationships among predictors when estimating model coefficients.

In our data set, we suspect that tree height and girth are correlated based on our initial data exploration. As we’ll begin to see more clearly further along in this post, ignoring this correlation between predictor variables can lead to misleading conclusions about their relationships with tree volume.

A better solution is to build a linear model that includes multiple predictor variables. We can do this by adding a slope coefficient for each additional independent variable of interest to our model.

Tree Volume ≈ Intercept + Slope1(Tree Girth) + Slope2(Tree Height) + Error

This is easy to do with the lm() function: We just need to add the other predictor variable.

fit_2 <- lm(Volume ~ Girth + Height, data = trees)
summary(fit_2)

We can see from the model output that both girth and height are significantly related to volume, and that the model fits our data well. Our adjusted R² value is also a little higher than our adjusted R² for model fit_1.

Since we have two predictor variables in this model, we need a third dimension to visualize it. We can create a nice 3d scatter plot using the package scatterplot3d:

First, we make a grid of values for our predictor variables (within the range of our data). The expand.grid() function creates a data frame from all combinations of the factor variables.

Girth <- seq(9,21, by=0.5) ## make a girth vector
Height <- seq(60,90, by=0.5) ## make a height vector
pred_grid <- expand.grid(Girth = Girth, Height = Height)  ## make a grid using the vectors

Next, we make predictions for volume based on the predictor variable grid:

pred_grid$Volume2 <-predict(fit_2, new = pred_grid)

Now we can make a 3d scatterplot from the predictor grid and the predicted volumes:

fit_2_sp <- scatterplot3d(pred_grid$Girth, pred_grid$Height, pred_grid$Volume2, angle = 60, color = "dodgerblue", pch = 1, 
              ylab = "Hight (ft)", xlab = "Girth (in)", zlab = "Volume (ft3)" )

And finally overlay our actual observations to see how well they fit:

fit_2_sp$points3d(trees$Girth, trees$Height, trees$Volume, pch=16)

Let’s see how this model does at predicting the volume of our tree. This time, we include the tree’s height since our model uses Height as a predictive variable:

predict(fit_2, data.frame(Girth = 18.2, Height = 72))

This time, we get a predicted volume of 52.13 ft³. This prediction is closer to our true tree volume than the one we got using our simple model with only girth as a predictor, but, as we’re about to see, we may be able to improve.

Accounting for interactions

While we’ve made improvements, the model we just built still doesn’t tell the whole story. It assumes that the effect of tree girth on volume is independent from the effect of tree height on volume. This is clearly not the case, since tree height and girth are related; taller trees tend to be wider, and our exploratory data visualization indicated as much. Put another way, the slope for girth should increase as the slope for height increases.

To account for this non-independence of predictor variables in our model, we can specify an interaction term, which is calculated as the product of the predictor variables.

Tree Volume ≈ Intercept + Slope1(Tree Girth) + Slope2(Tree Height) + Slope3(Tree Girth x Tree Height)+ Error

Once again, it’s easy to build this model using lm():

fit_3 <- lm(Volume ~ Girth * Height, data = trees)
summary(fit_3)

Note that the "Girth * Height" term is shorthand for "Girth + Height + Girth * Height" in our model.

As we suspected, the interaction of girth and height is significant, suggesting that we should include the interaction term in the model we use to predict tree volume. This decision is also supported by the adjusted R² value close to 1, the large value of F and the small value of p that suggest our model is a very good fit for the data.

Let’s have a look at a scatter plot to visualize the predicted values for tree volume using this model. We can use the same grid of predictor values we generated for the fit_2 visualization:

Girth <- seq(9,21, by=0.5)
Height <- seq(60,90, by=0.5)
pred_grid <- expand.grid(Girth = Girth, Height = Height)

Similarly to how we visualized the fit_2 model, we will use the fit_3 model with the interaction term to predict values for volume from the grid of predictor variables:

pred_grid$Volume3 <-predict(fit_3, new = pred_grid)

Now we make a scatter plot of the predictor grid and the predicted volumes:

fit_3_sp <- scatterplot3d(pred_grid$Girth, pred_grid$Height, pred_grid$Volume3, angle = 60, color = "dodgerblue", pch = 1, ylab = "Hight (ft)", xlab = "Girth (in)", zlab = "Volume (ft3)")

Finally, we overlay our observed data:

fit_3_sp$points3d(trees$Girth, trees$Height, trees$Volume, pch=16)

It’s a little hard to see in this picture, but this time our predictions lie on some curved surface instead of a flat plane. Now for the moment of truth: let’s use this model to predict our tree’s volume.

predict(fit_3, data.frame(Girth = 18.2, Height = 72))

Our predicted value using this third model is 45.89, the closest yet to our true value of 46.2 ft³.

Some cautionary notes about predictive models

Keep the range of your data in mind

When using a model to make predictions, it’s a good idea to avoid trying to extrapolate to far beyond the range of values used to build the model. To illustrate this point, let’s try to estimate the volume of a small sapling (a young tree):

predict(fit_3, data.frame(Girth = 0.25, Height = 4))

We get a predicted volume of 62.88 ft³, more massive than the tall trees in our data set. Of course this doesn’t make sense. Keep in mind that our ability to make accurate predictions is constrained by the range of the data we use to build our models.

Avoid making a model that's too specific to your data set

In the simple example data set we investigated in this post, adding a second variable to our model seemed to improve our predictive ability. However, when trying a variety of multiple linear regression models with many difference variables, choosing the best model becomes more challenging. If too many terms that don't improve the model's predictive ability are added, we risk "overfitting" our model to our particular data set. A model that is overfit to a particular data set loses functionality for predicting future events or fitting different data sets and therefore isn't terribly useful.

While methods we used for assessing model validity in this post (adjusted R², residual distributions) are useful for understanding how well your model fits your data, applying your model to different subsets of your data set can provide information about how well your model will perform in practice. This method, known as "cross-validation", is commonly used to test predictive models. In our example, we used each of our three models to predict the volume of a single tree. If we were building more complex models, however, we would want to withold a subset of the data for cross-validation.

Next Steps

We used linear regression to build models for predicting continuous response variables from two continuous predictor variables, but linear regression is a useful predictive modeling tool for many other common scenarios.

As a next step, try building linear regression models to predict response variables from more than two predictor variables. Think about how you may decide which variables to include in a regression model; how can you tell which are important predictors? How might the relationships among predictor variables interfere with this decision? Data sets in R that are useful for working on multiple linear regression problems include: airquality, iris, and mtcars.

Another important concept in building models from data is augmenting your data with new predictors computed from the existing ones. This is called feature engineering, and it’s where you get to use your own expert knowledge about what else might be relevant to the problem. For example, if you were looking at a database of bank transactions with timestamps as one of the variables, it’s possible that day of the week might be relevant to the question you wanted to answer, so you could compute that from the timestamp and add it to the database as a new variable. This is a complicated topic, and adding more predictor variables isn’t always a good idea, but it’s something you should keep in mind as you learn more about modeling.

In the trees data set used in this post, can you think of any additional quantities you could compute from girth and height that would help you predict volume? (Hint: think back to when you learned the formula for the volumes of various geometric shapes, and think about what a tree looks like.)

Finally, although we focused on continuous data, linear regression can be extended to make predictions from categorical variables, too. Try using linear regression models to predict response variables from categorical as well as continuous predictor variables. There are a few data sets in R that lend themselves especially well to this exercise: ToothGrowth, PlantGrowth, and npk.

Original. Reposted with permission.

Related: