A Brief Primer on Linear Regression – Part 2
This second part of an introduction to linear regression moves past the topics covered in the first to discuss linearity, normality, outliers, and other topics of interest.
By Pushpa Makhija, CleverTap.
In the first part, we had discussed that the main task for building a multiple linear regression model is to fit a straight line through a scatter-plot of data points in multidimensional space, that best estimates the observed trend.
While building models to analyze the data, the foremost challenge is, the correct application of the techniques– how well analysts can apply the techniques to formulate appropriate statistical models to solve real problems.
Furthermore, before proceeding to analyze the data using multiple regression, part of the process encompasses to ensure that data you want to analyze, can actually be analyzed using multiple regression. Therefore, it is only appropriate to use multiple regression if you understand the key assumptions underlying regression analysis and check whether your data “passes” the required assumptions to give a valid result.
Usually, it’s plausible for one or more of the assumptions being violated, while analyzing real-world data. Even when the data fails certain assumptions, there is often a solution to overcome this. First, let’s look at the assumptions, and then learn how to check / validate the assumptions and also discuss about the proposed solutions for correcting these violations, if any.
We would be using IVs for independent variables and DV for dependent variable interchangeably while going through listing and validating assumptions, exploring data, building the model and interpreting the model output.
Assumptions of Regression:
Number of Cases/Sample Size
When conducting regression analysis, the cases-to-Independent Variables ratio should ideally be 20 cases for every independent variable in the model. For instance, the simplest case with two IVs – would require that n>40. However, for qualitative i.e. categorical variables with many levels of values, we might require more than ideal 20 cases for this variable to have sufficient data points for each level of categorical variable.
In this age of Big Data, we don’t need to worry about dealing with small samples. But, this assumption violation does result in generalizability issue of not being able to apply the model’s valuable insights and recommendations to other similar samples or situations.
Type of the Variables
The dependent variable should be measured on a continuous scale (i.e. an interval or ratio variable). Examples include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg or pounds), and so on.
- Examples of ordinal variables include Likert items – a 7-point scale from “strongly agree” to “strongly disagree” or other way of ranking categories – a 3-point scale to explain the liking of the product, from “Yes”, “No” and “May be”.
- Examples of nominal variables include gender (2 groups: male and female), ethnicity (3 groups: Caucasian, African-American and Hispanic), physical activity level (5 groups: sedentary, slightly active, moderately active, active, and extremely active), profession (5 groups: surgeon, doctor, nurse, dentist, therapist) and so forth.
Revisiting our weight–height example, we notice two of the independent variables to be continuous and one as categorical – exercise level with 3 levels. Hence, for carrying out regression analysis, we need to create new variable(s) or recode the categorical variable – exercise level – into numerical values as the regression algorithm doesn’t work with non-numeric variables. Exercise level for each person can be recoded as (1=Sedentary, 2=Moderately Active, 3=Very Active) based on their lifestyle and attitude towards exercise.
Multiple regression technique does not test whether the data is linear. Instead, it requires the existence of a linear relationship between – the dependent variable and each of the independent variables, and the dependent variable and the independent variables collectively (assessed from the model fit or from 3rdscatterplot as shown below).
The above plots help us to visually answer: Are the two variables linearly related? We infer that each of the IVs (height, calorie intake) in first 2 plots, plotted one at a time, with the dependent variable (weight) and even the last plot (effect of IVs collectively via Predicted values of DV) signifies linear relationship between the variables.
Multiple Regression Analysis requires that variables are normally distributed. In practice, the distribution of the variables, close to normal distribution is acceptable. There are various ways to check the normality assumption. Histogram is a quick way to check normality.
The above histogram plot includes a density curve that closely depicts the bell-shaped curve of normal distribution.
Absence of MultiCollinearity
Multicollinearity pertains to the relationship among IVs. exists when the IVs are highly correlated with each other or when one IV is a combination of one or more of the other IVs.
For example, when we look into the pricing of house flat, both the variables – area in square feet and area in square cm or square inches doesn’t contribute much to the price prediction as these 2 variables give the same information, though in a different way and are highly correlated as evident from the conversion formula.
As indicated in the above plot, height and calorie intake used for predicting weight reflects no discernible pattern i.e. the data demonstrates an absence of multicollinearity.
The other criteria that could be used to detect multicollinearity are Tolerance, Variance Inflation Factor (VIF), or Condition Index.