Improving Zillow Zestimate with 36 Lines of Code

We built this project as a quick and easy way to leverage some of the amazing technologies that are being built by the data science community!

By Eduardo Ariño de la Rubia.

Zillow and Kaggle recently started a $1 million competition to improve the Zestimate. We are releasing a public Domino project that uses H2O’s AutoML to generate a solution.

The new Kaggle Zillow Price competition received a significant amount of press, and for good reason. Zillow has put $1 million on the line if you can improve the accuracy of their Zestimate feature. This is Zillow’s estimation as to the value of a home. As they state in the contest description, improving this estimate can more accurately reflect the value of the nearly 110 million homes in the US!

We built this project as a quick and easy way to leverage some of the amazing technologies that are being built by the data science community! In this project is a script take_my_job.R which uses the amazing H2O AutoML framework.

H2O’s machine learning library is an industry leader, and their latest foray into bringing AI to the masses is the AutoML functionality. With a single function call, it trains many models in parallel, ensembles them together, and builds a powerful predictive model.

The script is just 36 lines:

properties_file = file.path(data_path, "properties_2016.csv")
train_file = file.path(data_path, "train_2016.csv")
properties = fread(properties_file, header=TRUE, stringsAsFactors=FALSE,
                   colClasses = list(character = 50))
train      = fread(train_file)
properties_train = merge(properties, train, by="parcelid",all.y=TRUE)

In these first 12 lines, we set up our environment and import the data as R data.table objects. We are using Domino environment variable functionality in line 4 to not have to hardcode any paths in the script, as hardcoded paths often cause significant challenges.

On line 12, we are creating the training set by merging the properties file with the training dataset, which contains the logerror column we will be predicting.

h2o.init(nthreads = -1)
Xnames = names(properties_train)[which(names(properties_train)!="logerror")]
Y = "logerror"
dx_train = as.h2o(properties_train)
dx_predict = as.h2o(properties)
md = h2o.automl(x = Xnames, y = Y,
                training_frame = dx_train,
                leaderboard_frame = dx_train)

This block of code is all it takes to leverage H2O’s AutoML infrastructure!

On line 14 we are initializing H2O to use as many threads as the machine has cores. Lines 16 and 17 are for setting up the names of the predictor and response variables. On lines 19 and 20 we upload our data.table objects to H2O (which could have been avoided with h2o.importFile in the first place). In lines 22-25 we are telling H2O to build us the very best model it can, using RMSE as the early stopping metric, on the training dataset.

properties_target = h2o.predict(md@leader, dx_predict)
predictions = round(as.vector(properties_target$predict), 4)
result = data.frame(cbind(properties$parcelid, predictions, predictions * .99,
                          predictions * .98, predictions * .97, predictions * .96,
                          predictions * .95))
colnames(result) = c("parcelid","201610","201611","201612","201710","201711","201712")
options(scipen = 999)
write.csv(result, file = "submission_automl.csv", row.names = FALSE )


Lines 27-36 are our final bit of prediction and book-keeping.

On line 27 we are predicting our responses using the trained AutoML object. We then round the answer to 4 digits of precision, build the result data.frame, set the names, and write it out.

The only bit of trickery that I added was to shrink the logerror at every column by 1%, with the assumption that Zillow’s team is always making their models a little bit better.

With no input from me whatsoever, this package builds a model which provides a public leaderboard score of 0.0673569. Not amazing, but remarkable considering I haven’t even looked at the data. Bringing together H2O’s algorithms along with flexible scalable compute and easy environment configuration on Domino made this project quick and easy!

Try It Yourself

You are welcome to fork this public project, use it as a starting point, and manipulate it however you want. Both the code and the environment can be used on Domino with just a few clicks.

Unfortunately we are unable to provide you the data, per Kaggle’s strict rules. In order to use the data you will have to:

  1. Go to the Kaggle data page and download it.
  2. Upload it either to a Domino data project or right into your forked project.
  3. Modify line 4 of take_my_job.R to set the base path for your files. If you just drop it into your project, just set it to “./”, if you are using a data project, simply modify the environment variable I reference.

While hand-built solutions are scoring significantly better than this one on the Kaggle leaderboard, it’s still exciting that a fully automated solution does reasonably well. The future of fully automated data science is exciting, and we can’t wait to keep supporting the amazing tools the community develops!
Original. Reposted with permission.

Bio: Eduardo Ariño de la Rubia is a Chief Data Scientist at Domino Data Lab.