A Brief Primer on Linear Regression – Part III
This third part of an introduction to linear regression moves past the topics covered in the first to discuss linearity, normality, outliers, and other topics of interest.
By Pushpa Makhija, CleverTap.
In Part I, we learnt the basics of Linear Regression and in Part II, we have seen that testing the assumptions in simple and multiple regression before building a regression model is analogous to knowing the rules upfront before playing a fair game.
Building a regression model involves collecting predictor and response values for common samples, exploring data, cleaning and preparing data and then fitting a suitable mathematical relationship to the collected data to answer: Which factors matter the most? Which factors we can ignore?
In the linear regression model, the dependent variable is expressed as a linear function of two or more independent variables plus an error introduced to account for all other factors, as mentioned below:
Y = a + b1x1 + b2x2 + b3x3 + b4x4….+ bnxn + ϵ
Linear Regression Model Building
Prior to building any predictive model, the data exploration and preparation stage ensures that every variable is in the form as desired by model. The next step is to build a suitable model and then interpret the model output to assess whether the built model is a good fit for the given data.
As a quick recap of our height – weight dataset, a sample of 10 rows of this dataset has been displayed below:Let’s work on fitting a linear model to the above dataset by using height, calorie intake, and exercise level as predictors for predicting the response variable – weight of an individual and then derive valuable insights by interpreting the model output.
Interpreting the output
This output includes a conventional table with parameter estimates and their standard errors, t-value, p-value, F-statistic, as well as the residual standard error and multiple R-squared.
Now we define and explain briefly each component – marked as 1 – 10 in the above model output.
1 Residuals – are essentially the errors in prediction – for our example, the difference between the actual observed “Weight” values and the model’s predicted “Weight” values.
The Residuals section of the model output breaks down into 5 summary point measures, to help us assess whether the distribution of residuals is normal i.e. bell shaped curve across the mean value zero (0).
In our case, we see that the distribution of the residuals do not appear to be strongly normal as median is to the left of 0. That means the model predicts certain points that fall far away from the actual points.
Residuals can be thought of as similar to a dart board. A good model is the one which will hit the bull’s-eye some of the time. When it doesn’t hit the bull’s-eye, the miss should be close enough to the bull’s-eye than on the outer edges of the dart board.
2 Estimated Coefficients – are the unknown constants that represent the intercept and slope terms in the linear model. The estimated coefficient is the value of slope calculated by the regression.
As in the above table, the coefficient Estimate column contains two or more rows: the first one is the intercept and the rest rows are for each of the independent variable(s).
- The intercept is the base value for DV (weight) when all IVs (height, calorie, exercise level in our case) are zero. In this context, the intercept value is relatively meaningless since weight of 0 lbs is unlikely to occur for even an infant. Hence, we cannot draw any further interpretation from this coefficient.
- From the second row onwards there are slopes for each of the independent variables considered for building predictive model. In short, the size of the coefficient for each IV gives the size of the effect that variable has on the DV and the sign on the coefficient (positive or negative) gives the direction of the effect. For example, the effect height has in predicting weight of a person. The slope term of height indicates that for every 1 inch increase in the height of a person, the weight goes up by 3.891 pounds (lbs), holding all other IVs constant.
3 & 8 Standard Error of the Coefficient Estimate & Residual Standard Error – are just the standard deviations of Coefficient Estimate and Residuals. The standard error of the estimate measures the dispersion (variability) of the coefficient, the amount it varies across cases and Residual standard error reflects the quality of a linear regression fit. Lower the standard error, the better it is, for accuracy of predictions. Therefore, you would expect to observe most of the actual values to cluster fairly closely to the regression line.
For example, to decide among the 2 datasets of 10 heights having same mean of 69 inches but different standard deviations (SD) – one with σ = 2.7 and the other one with σ = 6.3, we should select the dataset with σ = 2.7 to use height as one of predictor for creating predictive model.
In our example, the std error of the height variable is 0.262 which is far less than the height Coefficient – Estimate (3.891). Also, the actual weight can deviate from the true regression line by approximately 8.968 lbs, on an average.