Modeling Price with Regularized Linear Model & XGBoost

We are going to implement regularization techniques for linear regression of house pricing data. Our goal in price modeling is to model the pattern and ignore the noise.

By Susan Li, Sr. Data Scientist

We would like to model the price of a house, we know that the price depends on the location of the house, square footage of a house, year built, year renovated, number of bedrooms, number of garages, etc. So those factors contribute to the pattern — premium location would typically lead to a higher price. However, all houses within the same area and have same square footage do not have the exact same price. The variation in price is the noise. Our goal in price modeling is to model the pattern and ignore the noise. The same concepts apply to modeling hotel room prices too.

Therefore, to start, we are going to implement regularization techniques for linear regression of house pricing data.

The Data

There is an excellent house prices data set can be found here.

import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV, ElasticNet
from xgboost import XGBRegressor, plot_importance
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
df.shape

Figure 1

(df.isnull().sum() / len(df)).sort_values(ascending=False)[:20]

Figure 2

The good news is that we have many features to play with (81), the bad news is that 19 features have missing values, and 4 of them have over 80% missing values. For any feature, if it is missing 80% of values, it can’t be that important, therefore, I decided to remove these 4 features.

df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'Id'], axis=1, inplace=True)

Explore Features

Target feature distribution

sns.distplot(df['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(df['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('Sale Price distribution')

# Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(df['SalePrice'], plot=plt)
plt.show();

Figure 3

The target feature - SalePrice is right skewed. As linear models like normally distributed data , we will transform SalePrice and make it more normally distributed.

sns.distplot(np.log1p(df['SalePrice']) , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(np.log1p(df['SalePrice']))
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('log(Sale Price+1) distribution')

# Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(np.log1p(df['SalePrice']), plot=plt)
plt.show();

Figure 4

Correlation between numeric features

pd.set_option('precision',2)
plt.figure(figsize=(10, 8))
sns.heatmap(df.drop(['SalePrice'],axis=1).corr(), square=True)
plt.suptitle("Pearson Correlation Heatmap")
plt.show();

Figure 5

There exists strong correlations between some of the features. For example, GarageYrBlt and YearBuilt, TotRmsAbvGrd and GrLivArea, GarageArea and GarageCars are strongly correlated. They actually express more or less the same thing. I will let ElasticNetCV to help reduce redundancy later.

Correlation between SalePrice and the other numeric features

corr_with_sale_price = df.corr()["SalePrice"].sort_values(ascending=False)
plt.figure(figsize=(14,6))
corr_with_sale_price.drop("SalePrice").plot.bar()
plt.show();

Figure 6

The correlation of SalePrice with OverallQual is the greatest (around 0.8). Also GrLivArea presents a correlation of over 0.7, and GarageCars presents a correlation of over 0.6. Let’s look at these 4 features in more detail.

sns.pairplot(df[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars']])
plt.show();

Figure 7

Feature Engineering

• Log transform features that have a highly skewed distribution (skew > 0.75)
• Dummy coding categorical features
• Fill NaN with the mean of the column.
• Train and test sets split.

feature_engineering_price.py

ElasticNet

• Ridge and Lasso regression are regularized linear regression models.
• ElasticNet is essentially a Lasso/Ridge hybrid, that entails the minimization of an objective function that includes both L1 (Lasso) and L2(Ridge) norms.
• ElasticNet is useful when there are multiple features which are correlated with one another.
• The class ElasticNetCV can be used to set the parameters alpha (α) and l1_ratio (ρ) by cross-validation.
• ElasticNetCVElasticNet model with best model selection by cross-validation.

Let’s see what ElasticNetCV is going to select for us.

ElasticNetCV.py

Figure 8

0< The optimal l1_ratio <1 , indicating the penalty is a combination of L1 and L2, that is, the combination of Lasso and Ridge.

Model Evaluation

ElasticNetCV_evaluation.py

Figure 9

The RMSE here is actually RMSLE ( Root Mean Squared Logarithmic Error). Because we have taken the log of the actual values. Here is a nice write up explaining the differences between RMSE and RMSLE.

Feature Importance

feature_importance = pd.Series(index = X_train.columns, data = np.abs(cv_model.coef_))

n_selected_features = (feature_importance>0).sum()
print('{0:d} features, reduction of {1:2.2f}%'.format(
n_selected_features,(1-n_selected_features/len(feature_importance))*100))

feature_importance.sort_values().tail(30).plot(kind = 'bar', figsize = (12,5));

Figure 10

A reduction of 58.91% features looks productive. The top 4 most important features selected by ElasticNetCV are Condition2_PosNMSZoning_C(all)Exterior1st_BrkComm & GrLivArea. We are going to see how these features compare with those selected by Xgboost.

Xgboost

The first Xgboost model, we start from default parameters.

xgb_model1.py

Figure 11

It is already way better an the model selected by ElasticNetCV!

xgb_model2.py

Figure 12

There was again an improvement!

The third Xgboost model, we add a learning rate, hopefully it will yield a more accurate model.

xgb_model3.py

Figure 13

Unfortunately, there was no improvement. I concluded that xgb_model2 is the best model.

Feature Importance

from collections import OrderedDict
OrderedDict(sorted(xgb_model2.get_booster().get_fscore().items(), key=lambda t: t[1], reverse=True))

Figure 14

The top 4 most important features selected by Xgboost are LotAreaGrLivAreaOverallQual & TotalBsmtSF.

There is only one feature GrLivArea was selected by both ElasticNetCV and Xgboost.

So now we are going to select some relevant features and fit the Xgboost again.

xgb_model5.py

Figure 15

Another small improvement!

Jupyter notebook can be found on Github. Enjoy the rest of the week!

Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.

Original. Reposted with permission.

Related: