Modeling Price with Regularized Linear Model & XGBoost
We are going to implement regularization techniques for linear regression of house pricing data. Our goal in price modeling is to model the pattern and ignore the noise.
By Susan Li, Sr. Data Scientist
We would like to model the price of a house, we know that the price depends on the location of the house, square footage of a house, year built, year renovated, number of bedrooms, number of garages, etc. So those factors contribute to the pattern — premium location would typically lead to a higher price. However, all houses within the same area and have same square footage do not have the exact same price. The variation in price is the noise. Our goal in price modeling is to model the pattern and ignore the noise. The same concepts apply to modeling hotel room prices too.
Therefore, to start, we are going to implement regularization techniques for linear regression of house pricing data.
There is an excellent house prices data set can be found here.
The good news is that we have many features to play with (81), the bad news is that 19 features have missing values, and 4 of them have over 80% missing values. For any feature, if it is missing 80% of values, it can’t be that important, therefore, I decided to remove these 4 features.
Target feature distribution
The target feature - SalePrice is right skewed. As linear models like normally distributed data , we will transform SalePrice and make it more normally distributed.
Correlation between numeric features
There exists strong correlations between some of the features. For example, GarageYrBlt and YearBuilt, TotRmsAbvGrd and GrLivArea, GarageArea and GarageCars are strongly correlated. They actually express more or less the same thing. I will let ElasticNetCV to help reduce redundancy later.
Correlation between SalePrice and the other numeric features
The correlation of SalePrice with OverallQual is the greatest (around 0.8). Also GrLivArea presents a correlation of over 0.7, and GarageCars presents a correlation of over 0.6. Let’s look at these 4 features in more detail.
- Log transform features that have a highly skewed distribution (skew > 0.75)
- Dummy coding categorical features
- Fill NaN with the mean of the column.
- Train and test sets split.
- Ridge and Lasso regression are regularized linear regression models.
- ElasticNet is essentially a Lasso/Ridge hybrid, that entails the minimization of an objective function that includes both L1 (Lasso) and L2(Ridge) norms.
- ElasticNet is useful when there are multiple features which are correlated with one another.
- The class ElasticNetCV can be used to set the parameters
l1_ratio(ρ) by cross-validation.
- ElasticNetCV: ElasticNet model with best model selection by cross-validation.
Let’s see what ElasticNetCV is going to select for us.
0< The optimal l1_ratio <1 , indicating the penalty is a combination of L1 and L2, that is, the combination of Lasso and Ridge.
The RMSE here is actually RMSLE ( Root Mean Squared Logarithmic Error). Because we have taken the log of the actual values. Here is a nice write up explaining the differences between RMSE and RMSLE.
A reduction of 58.91% features looks productive. The top 4 most important features selected by ElasticNetCV are Condition2_PosN, MSZoning_C(all), Exterior1st_BrkComm & GrLivArea. We are going to see how these features compare with those selected by Xgboost.
The first Xgboost model, we start from default parameters.
It is already way better an the model selected by ElasticNetCV!
The second Xgboost model, we gradually add a few parameters that suppose to add model’s accuracy.
There was again an improvement!
The third Xgboost model, we add a learning rate, hopefully it will yield a more accurate model.
Unfortunately, there was no improvement. I concluded that xgb_model2 is the best model.
The top 4 most important features selected by Xgboost are LotArea, GrLivArea, OverallQual & TotalBsmtSF.
There is only one feature GrLivArea was selected by both ElasticNetCV and Xgboost.
So now we are going to select some relevant features and fit the Xgboost again.
Another small improvement!
Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.
Original. Reposted with permission.
- Unveiling Mathematics Behind XGBoost
- Multi-Class Text Classification with Scikit-Learn
- An Introduction on Time Series Forecasting with Simple Neural Networks & LSTM