A Brief Primer on Linear Regression – Part III

This third part of an introduction to linear regression moves past the topics covered in the first to discuss linearity, normality, outliers, and other topics of interest.

9 Multiple R-squared & Adjusted R-squared – The R-squared statistic (R2), also known as Coefficient of determination, is a metric used to evaluate how well the model fits the actual data.

R2 corresponds with the proportion of the variance in the criterion variable which is accounted for, by the model.

R2 always lies between 0 and 1. Hence, a number near 0 represents that a regression does not explain the variability in the response variable and a number close to 1 does explain the observed variance in the response variable.

R2 tends to somewhat over-estimate the success of the model since it automatically and spuriously increases when extra explanatory variables are added to the model. Adj. R2 corrects this value to provide a better estimate of the true population value by taking into account the number of variables and the number of observations that goes into building the model.

where p is the total number of variables in the model (excluding the constant – Intercept term), and n is the sample size.

Unlike R2 – always increasing as more variables are included in the model, adjusted R2 increases only if the new term improves the model more than what would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. It is always lower than R2. Furthermore, adj. R2 is the preferred measure to evaluate the model fit as it adjusts for the number of variables considered.

While choosing between two models, it’s better to choose the one with higher adj. R2. However, this higher value doesn’t necessarily indicate the accuracy of the predictions and the adequacy of the regression model.

In our case,  ~80% of the variance in the response variable (weight) is explained by the predictors (height and calorie intake). Intuitively also, by knowing these values – height, calorie intake – we would be able to predict the weight of an individual quite well, as also reflected in the obtained relatively strong R2 value.

4 – 7; 10 t – value of the Coefficient Estimate; Variable p – value, Significance Stars and Codes; F-statistic with Degrees of Freedom and p-value – are the terms used to assess the model fit and the significance of the model or its components through the statistical tests.

t – value of the Coefficient Estimate

is a score to measure whether or not the regression coefficient for the variable is meaningful for the model i.e. the coefficient is significant and different from zero.

The t-statistic value is computed as:

In our example, the t-values of height, calorie intake are relatively far away from zero and are large relative to the standard error, which could indicate an existence of the relationship.

Variable p-value & Significance Stars and Codes

p-value indicates a probability that the variable is NOT relevant i.e. Pr(>(|t|) acronym in the model output. A small p-value indicates that it is unlikely that a relationship between a predictor (say, height) and response (weight) variables exists due to chance. Generally, a p-value of 5% or less is considered as cut-off point.

In our example, the p-values for height and calorie intake are very close to zero (indicated by ‘***’ in the table), suggesting that it is likely that significant relationship exists between height, calorie intake and weight of the people – with the obtained coefficient estimates – different from zero.

F-statistic, Degree of Freedom and Resulting p-value

are the metrics to evaluate the overall model fit of the data. F-statistic is a good indicator to assess whether there is a relationship between the dependent and independent variables. The further F-statistic is from 1, the higher the likelihood of the existence of relationship between dependent and independent variables.

To explain Degrees of Freedom, let’s consider a scenario where we know 9 of the data points and the mean of 10 data points. We don’t have freedom to choose the actual value of 10th observation, as we can easily calculate the same by (mean * 10 – Sum of all 9 observations). This results in one data point going into estimating this actual value of 10th data point, giving us choice of 9 degrees of freedom (d.f.) for these 9 known points.

In our example, the F-statistic is 98.53 which is much larger than 1 given in the 100 observations. The degrees of freedom are 4 (the number of variables used in the model (5 – including Intercept) – 1) and 95 (the number of observations included in the dataset (100) – the number of variables used in the model (5)). Also, the p-value is low and close to 0. Hence, a large value of F and small p-value indicates the overall significance of the model.

Closing Thoughts

It is often trickier to spot a bad model rather than identifying and selecting a good model.

Multiple regression analysis is not only the most widely used tool but also the most abused one. Furthermore, the sensible use of linear regression requires one to check for any errors in variables, treat outliers and any missing values, validate the underlying assumptions for any violation(s); determine the goodness of fit  and accuracy of the model through statistical tests; deal with potential problems that may occur in the model and the difficulties involved in rigorously evaluating the quality and robustness of the model fit. Linear regression is important because it is the basic model used by many analysts to compare with other complex models to generate data insights.