Doing Data Science: A Kaggle Walkthrough Part 5 – Adding New Data
Here is part 5 of the weekly 6 part series on doing data science in the context of a Kaggle competition, which concentrates on adding in new data.
Determine Counts of Actions
The next thing we are going to do is take counts of how many times each action was taken by each user. This is a two-step process. The first step is to determine the count of each action type for each user:
Step 1
Step 2
For you Excel buffs out there, the second step might strike you as something that could be achieved using a pivot table – and you would be right. In fact, the custom function that we use to make this transformation uses a pandas method called ‘pivot’. This is important to note for a couple of reasons. The first is that, with all the talk about new data, people who have worked with data mostly (or entirely) using ‘old technology’ like Excel and SQL are often given the impression that their skills are redundant or not useful in modern data science. As this example shows, the ways of thinking about data that you develop working with Excel and SQL are not only relevant, but often extremely useful.
The second reason is that for people (like me) who do not know all the methods available for pandas dataframes off by heart, being able to identify techniques you have used in other programs and languages provides you with a way to find corresponding methods in new languages. I discovered this method by searching for “pandas pivot”, knowing that this way of manipulating data was likely to have some equivalent in pandas.
Looping Through the Actions Columns
Looking at the examples above, you may have realized that the transformation as shown only works for one action column at a time, but in the data we have three action columns: action,action_type and action_detail.
To handle the multiple action columns, we repeat these steps for each column individually, effectively creating three separate tables. Because we have now created tables where each row represents one user, we can now join (another concept SQL users will be very familiar with) these three tables together on the basis of the user id. The full code for these steps is shown below:
# Count occurrences of value in a column def convert_to_counts(df, id_col, column_to_convert): id_list = df[id_col].drop_duplicates() df_counts = df.loc[:,[id_col, column_to_convert]] df_counts['count'] = 1 df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum() new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count') new_df = new_df.fillna(0) # Rename Columns categories = list(df[column_to_convert].drop_duplicates()) for category in categories: cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower() col_name = column_to_convert + '_' + cat_name new_df.rename(columns = {category:col_name}, inplace=True) return new_df # Aggregate and combine actions taken columns print("Aggregating actions taken...") session_actions = sessions.loc[:,['user_id', 'action', 'action_type', 'action_detail']] columns_to_convert = ['action', 'action_type', 'action_detail'] session_actions = session_actions.fillna('not provided') first = True for column in columns_to_convert: print("Converting " + column + " column...") current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column) # If first loop, current data becomes existing data, otherwise merge existing and current if first: first = False actions_data = current_data else: actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')
Combine Data Sets
The final steps are to combine the various datasets we have created into one large dataset. First we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe. Then we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv. Finally, we combine the sessions dataframe with the user data dataframe from Part IV. The code for the various combinations is shown below:
# Merge device datasets print("Combining results...") df_primary.set_index('user_id', inplace=True) df_secondary.set_index('user_id', inplace=True) device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer") # Merge device and actions datasets combined_results = pd.concat([device_data, actions_data], axis=1, join='outer') df_sessions = combined_results.fillna(0) # Merge user and session datasets df_all.set_index('id', inplace=True) df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')
A Note on Joins
For those that can read a little bit of code and are familiar with joins in SQL, you may be asking why I am using (full) outer joins for the first two combinations, but an inner join for the final step[2].
The first step requires an outer join because not all users have a secondary device. That is, some users only logged onto Airbnb using one device (or at least one type of device). Doing an outer join here ensures that our dataset includes all users regardless of this fact.
The second step could use an inner or an outer join, as both the device and actions datasets should contain all users. In this case we use an outer join just to ensure that if a user is missing from one of the datasets (for whatever reason), we will still capture them. You may also notice that after the second step we fill any missing values with 0s to ensure we do not have any NULL values that may have been generated by these outer joins.
For the third step we use an inner join for a key reason – we want our final training dataset to only include users that also have sessions data. Using an inner join here is an easy way to join the datasets and filter for the users with sessions data in one step.
Wrapping Up
In the first four parts of this series, we looked in detail at some of the various steps in the process of building a model. Although these steps should be distinct thought processes that occur for each model building process, hopefully what this article provides is an insight into how some of these steps can be combined if planned out carefully. In relatively few steps, we have taken a dataset containing 10 million rows of user actions data, cleaned it, extracted a bunch of important information, and added it to our user data, ready for training a model.
The other important thing to take away from this article is how useful ‘old school’ ways of thinking about data still are. For all the talk about unstructured data and NoSQL databases, the fact is that knowing how to work with and manipulate old fashioned columns and rows is still as important as ever. Whether it is joins and aggregation in SQL, pivot tables and VLOOKUPS in Excel, or just the general concept of relational data, not only is that knowledge relevant, but it is often extremely useful.
Next Time
In the next piece, we will finally get to the good stuff and train the algorithm to make the final predictions.
[1] For those that do not understand what I mean by inner and outer joins (and are interested in knowing) – stackoverflow comes to the rescue again with this great illustrated answer.
[2] Nope, still doesn’t qualify as ‘Big Data’...
Bio: Brett Romero is a data analyst with experience working in a number of countries and industries, including government, management consulting and finance. Currently based in Pristina, Kosovo, he is working as a data consultant with development agencies such as UNDP and Open Data Kosovo.
Original. Reposted with permission.
Related: