Doing Data Science: A Kaggle Walkthrough Part 5 – Adding New Data

Here is part 5 of the weekly 6 part series on doing data science in the context of a Kaggle competition, which concentrates on adding in new data.

By Brett Romero, Open Data Kosovo.

This article is Part V in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part IPart IIPart III and Part IV).

Continuing on the walkthrough, in this part we take the data from sessions.csv that we left aside initially and add it to the transformed and expanded data from Part IV.  This part will cover, in brief, all the steps in Parts II – IV.

The Matrix

Understanding the Data

As we did for the user data in training.csv, the first step here is to understand what the data insessions.csv looks like. Although this file, with over 10 million rows, is too large to display in entirety in Excel[1], we can still open the file using Excel to get an understanding of what columns we have and what at least the first million rows of data looks like. Some sample rows are provided below:

Table 1

As can be seen, the dataset contains records of user actions, with each row representing one action a user took. Every time a user reviewed search results, updated a wish list or updated their account information, a new row was created in this dataset. Although this data is likely to be very useful for our goal of predicting which country a user will make their first booking in, it also complicates the process of combining this data with the data from training.csv, as it will have to be aggregated so that there is one row per user (as opposed to many rows for each user, currently).

Aside from details of the actions taken, there are a couple of interesting fields in this data. The first is device_type – this field contains the type of device used for the specified action. The second interesting field is the secs_elapsed field. This shows us how long (in seconds) was spent on a particular action.

Both of these fields provide us with potentially important information that could help to more accurately predict which country a user will make a first booking in. For example, it is not difficult to imagine that people spending relatively little time to make a booking on a phone are likely to be making bookings in locations closer to home (i.e. the US) than someone spending more time to make a booking on a desktop computer. Of course this is just a theory that needs to be proven, but it is a good reason to ensure we are capturing this information in our final training dataset.

Cleaning and Transforming the Data

Now that we have a basic understanding of the data, we need to undertake the cleaning and transformation steps. Because of the structure of this data (and for the sake of brevity), we are going to do both of these things at the same time.

The first step is to import the data:

# Import sessions data
s_filepath = "./sessions.csv"
sessions = pd.read_csv(s_filepath, header=0, index_col=False)

Extract the primary and secondary devices for each user

Remembering that we need to get the final data into a format that can be merged with the data created in Part IV (i.e. a dataset where one row equals one user), the first piece of information we are going to extract is the primary and secondary device for each user. How do we determine what a user’s primary and secondary devices are? We look at how much time they spent on each device. In short we are going to make the following changes to the data:


One thing to note as we make these transformations is that by aggregating the data this way, we are also implicitly removing the missing values. The code to do this transformation is shown below:

# Determine primary device
print("Determing primary device...")
sessions_device = sessions.loc[:, ['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).aggregate(np.sum)
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = pd.DataFrame(aggregated_lvl1.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

# Determine Secondary device
print("Determing secondary device...")
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = pd.DataFrame(remaining.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)</div>