A Brief Primer on Linear Regression – Part 2
This second part of an introduction to linear regression moves past the topics covered in the first to discuss linearity, normality, outliers, and other topics of interest.
Absence of Significant Outliers among variables
There should be no significant outliers for both– among IVs and on DV. Outliers are points which lie outside the overall pattern of the data. The removal of these influential observations can cause the regression equation to change considerably and may improve correlation.
Potential outliers could be identified from the plots of each of the IVs and DV for weight – height example as below:
The remedial measures for treating outliers could be:
- An outlier for a particular IV can be tackled either by deleting the entire observation, counting those extreme values as missing and then treat missing values, or retain the outlier by reducing the extremity by assigning a high score/value for that variable, but not too different from the remaining cluster of scores.
- Outliers on DV can be identified readily on residual plots since they are cases with very large positive or negative residuals (errors). In practice, standardized residual values greater than an absolute value of 3.3(i.e. above 3.3 or less than -3.3) are considered outliers.
Normality, Linearity, Homoscedasticity and Independence of Residuals
Residuals are the errors in prediction–the difference between observed and predicted DV scores.
These characteristics of Residuals illustrates the nature of the underlying relationship between the variables, which can be checked from residuals scatter-plots.
The residual scatter-plots allow you to check
- Normality: The residuals should be normally distributed. Though, in practice, the distribution of errors, close to normal is acceptable.
The normality of errors could be gauged through:
(i) Histogram of Errors– should be mound shaped around 0.
(ii) Normal Probability Plot (Q-Q plot)– is a scatter-plot created by plotting 2 sets of quantiles (often termed as “percentiles”) against one another. For example, the 0.3 (or 30%) quantile is the point at which 30% of the data fall below and 70% fall above that value. Q-Q plot help us to access if a dataset probably came from some theoretical distribution such as Normal, or other distribution.
(iii) Statistical tests like Correlation test, Wilks-Shapiro test etc
Linearity: The residuals plot should reflect a random scatter of points. A non-random pattern suggests that a linear model is inappropriate, and that data may require some transformation of the response or predictor variables or add a quadratic or higher term in the equation.
As seen in the above residuals plot – first one shows a pattern i.e. the relationship between IVs and DV is not linear. Therefore, the results of the regression analysis would under-estimate the true relationship.
- Homoscedasticity: The scatter-plot is a good way to check whether homoscedasticity (i.e. the error terms along the regression are equal è constant variance across IV values) is given.
The homoscedasticity and heteroscedasticity plots of data reveals either no pattern or some pattern as shown below:
Heteroscedasticity i.e. non-constant variance of errors can lead to serious distortion in findings and weaken the analysis and increase the prediction errors. A non-linear transformation might fix this problem.
- Independence: The residuals should be independently distributed i.e. no correlation between consecutive errors. In other words, one of the error is independent of the value of another error(s).
A random pattern of Errors, as above indicates independence of errors.
Closing Thoughts:
It may happen that you get fascinated by the insights arising from your linear regression model, but you should force yourself to probe into the validity and conformance of the key assumptions underlying your regression model, so as to be able to apply it and get similar results from unseen or new data.
In the concluding part, we will learn how to build the regression model and interpret the model output to evaluate the quality of the model.
Original. Reposted with permission.
Related: