How Bad is Multicollinearity?
For some people anything below 60% is acceptable and for certain others, even a correlation of 30% to 40% is considered too high because it one variable may just end up exaggerating the performance of the model or completely messing up parameter estimates.
In regression you always check for correlations between your predictor variables because it will have negative implications on your inference later. For some people anything below 60% is acceptable and for certain others, even a correlation of 30% to 40% is considered too high because it one variable may just end up exaggerating the performance of the model or completely messing up parameter estimates. Multicollinearity just makes your model filled with redundant variables and increases the error regression error estimates.
- Based on domain knowledeg, you may believe a certain predictor should have a certain type of relationship with your dependent variable. After fitting a regression model and you see results to the contrary, you may need to check for any correlations between your independent variables. Multicollinearity can effect the sign of the relationship (i.e. positive or negative) and the degree of effect on the independent variable.
- When adding or deleting a variable, the regression coefficients can change dramatically if multicollinearity was present.
- Using pairwise correlation, the user can see the correlation between their variables in tabular or visual format. In this scenario, there is no universal cutoff but a good heuristic is that multicollinearity is present with correlations above 0.5. However, in real life situations, even 0.5 is too high and the threshold may have to be lowered.
- Using the Variance Inflation Factor (VIF), a VIF > 1 indicates a degree of multicollinearity. A VIF=1 indicates no multicollinearity. The VIF only shows what variables are correlated with each other but the decision to remove variables is in the user's hand. VIF is scale independent so it can show very high coefficients
Types of Multicollinearity
There are 2 types of multicollinearity:
- Data-based Multicollinearity - This is the most common type we all see. This is where after sampling and we observe the variables, we can see that during analysis some or all the variables are correlated with each other to some extent. This is typically seen in observational studies.
- Structural Multicollinearity - This occurs when we create new features from the data itself rather than the actual data sampled. For example when you square one of your variables or apply some arithmetic with some variables to make a new variable, there will be some correlation between the new and original variable.
Multicollinearity is not a cause for conern when:
- The correlated variables are control variables
- Control variables are held constant throughout the entire experiment to observe the effect of actual variables of interest on the dependent variable. Control variables are not of interest to the user and are therefore will not have a major effect on inference
- Specifying a regression model with a term that is the result of some arithmetic of another variable will be expected to give a high degree of multicollinearity. The p-values for your new variables are not to be affected in this case. If the user understands the interpretation of this scenario, it is not a great cause for concern.
- If there were three categories and two categories had 50% and 40% respectively leaving the last one with only 10% and the last category was used as the reference category, there is likely to be a degree of multicollinearity that will affect the model. In this case it is best to use a category with a higher proportion as the reference category. In cases with more categories, dropping some categories may need to be considered.
Handling Multicollinearity in R and Python
- From Data Pre-processing to Optimizing a Regression Model Performance
- How do you check the quality of your regression model in Python?
- Regression Analysis: A Primer