Doing Data Science: A Kaggle Walkthrough Part 4 – Data Transformation and Feature Extraction
Part 4 of this fantastic 6 part series covering the process of data science, and its application to a Kaggle competition, focuses on feature extraction and data transformation.
Feature Extraction
Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.
Using Hierarchical Information
It will sometimes be the case that data in your dataset represents one level of a particular hierarchy, and that extracting the other implied levels of that hierarchy will provide the model with useful information.
For example, imagine a dataset with a column containing countries. This column allows the algorithm to look for patterns (in combination with all other columns) at the country level. However, by adding a new ‘region’ column based on the country column (Europe, South Asia, North Africa etc.), you may be providing information to the algorithm that allows it look for patterns across countries.
One of the most common ways to do this is with date fields. Take the date fields in the dataset we are working with as an example. By extracting the day of the week, the month of the year or the hour of the day, we could add important information for the algorithm to use. Maybe people who create their accounts in summer months are more likely to make a booking in a warmer country. Maybe people who were first active late at night are more disorganized travelers and are therefore more likely to make a domestic first booking. Additionally, it could be any combination of these factors that makes the difference (e.g. users first active late at night, in the summer months, on a weekday are more likely to travel to Portugal). The point is not to be able to explain why a factor may be important, but to think of as many factors as possible to test, and allow the algorithm to determine what is important and not important.
Adding External Data
One of the aspects of feature extraction that often gets overlooked is how data can be enriched through the addition of new external data. Using techniques such as record linkage, existing datasets can be greatly expanded by adding new data points for a given record. This new data often provides valuable new information that the algorithm can use to make more accurate predictions.
For example, a training dataset that contains a column with countries could be enriched with demographic data about the country such as population, income per capita or land area – all factors that may allow the algorithm to draw conclusions across similar groups of countries on any of those measures.
Relating this concept to the competition we are working through, consider how much more accurately we could predict a first booking country of a user if we could link the data from their Airbnb profile to data from one of their social media profiles (Facebook, Twitter etc.) or even better, from a Tripadvisor or Expedia account.
The key point here is that it is worth investing time looking for ways to add new and useful data to your existing dataset before moving onto the modeling step. Expending your dataset in this manner will often produce far bigger improvements in prediction accuracy than the choice of algorithm or the tuning of the algorithm parameters.
The Importance of Domain Knowledge
One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?
This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including what was covered in Part III, cleaning the data.
Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:
- what additional information is available to expand your dataset
- what information may help to increase prediction accuracy and what is likely to be irrelevant
- if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?[1]), and
- if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.
In practical terms, where does this leave aspiring data scientists?
The first thing is to realize that, obviously, it is not possible to be a domain expert for every domain. Acknowledging this limitation is important as it forces a second realization – you will almost always need to seek out this expertise. For most of us that means involving and utilizing people who are domain experts when constructing your dataset and model. Having access to that expertise is likely to be the difference between a model that gets thrown out in 6 months and one that fundamentally improves a business and/or fulfills a customer need.
Step by Step
After all the theory, let’s put some of these techniques into practice.
Transforming Categorical Data
The first step we are going to undertake is some One Hot Encoding – replacing the categorical fields in the dataset with multiple columns representing one value from each column.
To do this, the Scikit Learn library comes with a One Hot Encoder method that we could use to do these transformations, but it is often instructive to write your own function, particularly if it is a relative simple one like this. The code snippet below creates a simple function to do the encoding for a specified column, and then uses that function in a loop to convert all the categorical columns (and then delete the original columns).
# Home made One Hot Encoding function def convert_to_binary(df, column_to_convert): categories = list(df[column_to_convert].drop_duplicates()) for category in categories: cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower() col_name = column_to_convert[:5] + '_' + cat_name[:10] df[col_name] = 0 df.loc[(df[column_to_convert] == category), col_name] = 1 return df # One Hot Encoding print("One Hot Encoding categorical data...") columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser'] for column in columns_to_convert: df_all = convert_to_binary(df=df_all, column_to_convert=column) df_all.drop(column, axis=1, inplace=True)
Creating New Features
From Part II of this series, one of the things we observed about the training (and test) datasets is that there is not a huge number of columns to work with. This limits what new features we can add based on the existing data. However, two fields that can be used to create some new features are the two date fields – date_account_created and timestamp_first_active. We want to extract all the information we can out of these two date fields that could potentially differentiate which country someone will make their first booking in. The code for extracting a range of different data points from these two date columns (and then deleting the original date columns) is shown below:
# Add new date related fields print("Adding new fields...") df_all['day_account_created'] = df_all['date_account_created'].dt.weekday df_all['month_account_created'] = df_all['date_account_created'].dt.month df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter df_all['year_account_created'] = df_all['date_account_created'].dt.year df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days # Drop unnecessary columns columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking'] for column in columns_to_drop: if column in df_all.columns: df_all.drop(column, axis=1, inplace=True)</div>
Wrapping Up
In two relatively simple steps, we have changed our training dataset from 14 columns to 163 columns. Although this seems like a lot more information, most of this expansion was caused by the One Hot Encoding, which is not adding more information, but simply expanding out the existing information. We have not added any external data, and I didn’t even really investigate what information we could have extracted from the other non-date columns.
Again, this process is open ended, so there is an almost unlimited range of possibilities that we have not even really begun to explore. As such, if you see an additional transformation or have an idea for the addition of a new feature, please feel free to let me know in a comment!
Next Time
In the next piece, we will look at the data in sessions.csv that we left aside initially and see how we can add that data to our training dataset.
[1] This is an example of the existence of a confounding factor. A model predicting whether someone will wakeup with a headache based on whether they slept with their shoes on ignores that there is a more logical explanation for the headaches – in this case that both the headaches and sleeping with shoes on are caused by a third factor – going to bed drunk.
Bio: Brett Romero is a data analyst with experience working in a number of countries and industries, including government, management consulting and finance. Currently based in Pristina, Kosovo, he is working as a data consultant with development agencies such as UNDP and Open Data Kosovo.
Original. Reposted with permission.
Related: