Doing Data Science: A Kaggle Walkthrough Part 2 – Understanding the Data
This is the second post in a fantastic 6 part series covering the process of data science, and the application of the process to a Kaggle competition. Read on for a great overview of practicing data science.
By Brett Romero, Open Data Kosovo.
This article on understanding the data is Part II in a series looking at data science and machine learning by walking through a Kaggle competition. Part I can be found here.
Continuing on the walkthrough of data science via a Kaggle competition entry, in this part we focus on understanding the data provided for the Airbnb Kaggle competition.
Reviewing the Data
In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):
- What features (columns) does the dataset contain?
- How many records (rows) have been provided?
- What format is the data in (e.g. what format are the dates provided, are there numerical values, what do the different categorical values look like)?
- Are there missing values?
- How do the different features relate to each other?
For this competition, Airbnb have provided 6 different files. Two of these files provide background information (countries.csv and age_gender_bkts.csv), whilesample_submission_NDF.csv provides an example of how the submission file containing our final predictions should be formatted. The three remaining files are the key ones:
- train_users_2.csv – This dataset contains data on Airbnb users, including the destination countries. Each row represents one user with the columns containing various information such the users’ ages and when they signed up. This is the primary dataset that we will use to train the model.
- test_users.csv – This dataset also contains data on Airbnb users, in the same format astrain_users_2.csv, except without the destination country. These are the users for which we will have to make our final predictions.
- sessions.csv – This data is supplementary data that can be used to train the model and make the final predictions. It contains information about the actions (e.g. clicked on a listing, updated a wish list, ran a search etc.) taken by the users in both the testing and training datasets above.
With this information in mind, an easy first step in understanding the data is reviewing the information provided by the data provider – Airbnb. For this competition, the information can be found here. The main points (aside from the descriptions of the columns) are as follows:
- All the users in the data provided are from the USA.
- There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’.
- ‘other’ means there was a booking, but in a country not included in the list, while ‘NDF’ means there was not a booking.
- The training and test sets are split by dates. In the test set, you will predict the destination country for all the new users with first activities after 7/1/2014
- In the sessions dataset, the data only dates back to 1/1/2014, while the training dataset dates back to 2010.
After absorbing this information, we can start looking at the actual data. For now we will focus on the train_users_2.csv file only.
Looking at the sample of three records above provides us with a few key pieces of information about this dataset. The first is that at least two columns have missing values – the age column and date_first_booking column. This tells us that before we use this data for training a model, these missing values need to be filled or the rows excluded altogether. These options will be discussed in more detail in the next part of this series.
Secondly, most of the columns provided contain categorical data (i.e. the values represent one of some fixed number of categories). In fact 11 of the 16 columns provided appear to be categorical. Most of the algorithms that are used in classification do not handle categorical data like this very well, and so when it comes to the data transformation step, we will need to find a way to change this data into a form that is more suited for classification.
Thirdly, the timestamp_first_active column looks to be a full timestamp, but in the format of a number. For example 20090609231247 looks like it should be 2009-06-09 23:12:47. This formatting will need to be corrected if we are to use the date values.