Doing Data Science: A Kaggle Walkthrough Part 2 – Understanding the Data
This is the second post in a fantastic 6 part series covering the process of data science, and the application of the process to a Kaggle competition. Read on for a great overview of practicing data science.
Now that we have gained a basic understanding of the data by looking at a few example records, the next step is to start looking at the structure of the data.
Country Destination Values
Arguably, the most important column in the dataset is the one the model will try to predict –country_destination. Looking at the number of records that fall into each category can help provide some insights into how the model should be constructed as well as pitfalls to avoid.
Looking at the breakdown of the data, one thing that immediately stands out is that almost 90% of users fall into two categories, that is, they are either yet to make a booking (NDF) or they made their first booking in the US. What’s more, breaking down these percentage splits by year reveals that the percentage of users yet to make a booking increases each year and reached over 60% in 2014.
For modeling purposes, this type of split means a couple of things. Firstly, the spread of categories has changed over time. Considering that our final predictions will be made against user data from July 2014 onwards, this change provides us with an incentive to focus on more recent data for training purposes, as it is more likely to resemble the test data.
Secondly, because the vast majority of users fall into 2 categories, there is a risk that if the model is too generalized, or in other words not sensitive enough, it will select one of those two categories for every prediction. A key step will be ensuring the training data has enough information to ensure the model will predict other categories as well.
Account Creation Dates
Let’s now move onto the date_account_created column to see how the values have changed over time.
Chart 1 provides excellent evidence of the explosive growth of Airbnb, averaging over 10% growth in new accounts created per month. In the year to June 2014, the number of new accounts created was 125,884 – 132% increase from the year before.
But aside from showing how quickly Airbnb has grown, this data also provides another important insight, the majority of the training data provided comes from the latest 2 years. In fact, if we limited the training data to accounts created from January 2013 onwards, we would still be including over 70% of all the data. This matters because, referring back to the notes provided by Airbnb, if we want to use the data in sessions.csv we would be limited to data from January 2014 onwards. Again looking at the numbers, this means that even though the sessions.csv data only covers 11% of the time period (6 out of 54 months), it still covers over 30% of the training data – or 76,466 users.
Looking at the breakdown by age, we can see a good example of another issue that anyone working with data (whether a Data Scientist or not) faces regularly – data quality issues. As can be seen from Chart 2, there are a significant number of users that have reported their ages as well over 100. In fact, a significant number of users reported their ages as over 1000.
So what is going on here? Firstly, it appears that a number of users have reported their birth year instead of their age. This would help to explain why there are a lot of users with ‘ages’ between 1924 and 1953. Secondly, we also see significant numbers of users reporting their age as 105 and 110. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Either way, these values would appear to be errors that will need to be addressed.
Additionally, as we saw in the example data provided above, another issue with the age column is that sometimes age has not been reported at all. In fact, if we look across all the training data provided, we can see a large number of missing values in all years.
When we clean the data, we will have to decide what to do with these missing values.
First Device Type
Finally, one last column that we will look at is the first_device_used column.
The interesting thing about the data in this column is how the types of devices used have changed over time. Windows users have increased significantly as a percentage of all users. iPhone users have tripled their share, while users using ‘Other/unknown’ devices have gone from the second largest group to less than 5% of users. Further, the majority of these changes occurred between 2011 and 2012, suggesting that there may have been a change in the way the classification was done.
Like with the other columns we have reviewed above, this change over time reinforces the presumption that recent data is likely to be the most useful for building our model.
It should be noted that although we have not covered all of them here, having some understanding of all the data provided in a dataset is important for building an accurate classification model. In some cases, this may not be possible due to the presence of a very large number of columns, or due to the fact that the data has been abstracted (that is, the data has been converted into a different form). However, in this particular case, the number of columns is relatively small and the information is easily understandable.
Now that we have taken the first step – understanding the data – in the next piece, we will start cleaning the data to get it into a form that will help to optimize the model’s performance.
Bio: Brett Romero is a data analyst with experience working in a number of countries and industries, including government, management consulting and finance. Currently based in Pristina, Kosovo, he is working as a data consultant with development agencies such as UNDP and Open Data Kosovo.
Original. Reposted with permission.