Doing Data Science: A Kaggle Walkthrough – Cleaning Data

Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.



Step by Step

 
With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files – train_users_2.csv and test_users.csv and leave aside sessions.csv.

Loading in the Data

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two files train_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once [1].

import pandas as pd
# Import data
print("Reading in data...")
tr_filepath = "./train_users_2.csv"
df_train = pd.read_csv(tr_filepath, header=0, index_col=None)
te_filepath = "./test_users.csv"
df_test = pd.read_csv(te_filepath, header=0, index_col=None)

# Combine into one dataset
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)


Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For the timestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because the date_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

# Change Dates to consistent format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['date_account_created'].fillna(df_all.timestamp_first_active, inplace=True)


Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

Clean data

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

# Remove date_first_booking column
df_all.drop('date_first_booking', axis=1, inplace=True)


Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

import numpy as np

# Remove outliers function
def remove_outliers(df, column, min_val, max_val):
col_values = df[column].values
df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)
return df

# Fixing age column
print("Fixing age column...")
df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)
df_all['age'].fillna(-1, inplace=True)


As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

  • Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
  • Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – the first_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

# Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)


Is that all?

Those more experienced with working with data may be thinking that we have not done all that much cleaning with this data – and you would be right. One of the nice things about Kaggle competitions is that the data provided does not require all that much cleaning as that is not what the providers of the data want participants to focus on. Many of the problems that would be found in real world data (as covered earlier) do not exist in this dataset, saving us significant time.

However, what this relatively easy cleaning process also tells us is that even when datasets are provided with the intention of needing no or minimal cleaning, there is always something that needs to be done.

Next Time

 
In the next piece, we will focus on transforming the data and feature extraction, allowing us to create a training dataset that will hopefully allow the model to make better predictions. To make sure you don’t miss out, use the subscription feature below.

[1] For those with more data mining experience you may realize that combining the test and training data at this stage is not best practice. The best practice would be to avoid using the test dataset in any of the data preprocessing or model tuning/validation steps to avoid over fitting. However, in the context of this competition, because we are only trying to create the model to classify one unchanging dataset, simply maximizing the accuracy of the model for that dataset is the primary concern.

Bio: Brett Romero is a data analyst with experience working in a number of countries and industries, including government, management consulting and finance. Currently based in Pristina, Kosovo, he is working as a data consultant with development agencies such as UNDP and Open Data Kosovo.

Original. Reposted with permission.

Related: