Doing Data Science: A Kaggle Walkthrough Part 4 – Data Transformation and Feature Extraction
Part 4 of this fantastic 6 part series covering the process of data science, and its application to a Kaggle competition, focuses on feature extraction and data transformation.
By Brett Romero, Open Data Kosovo.
This article on data transformation and feature extraction is Part IV in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read Part I, Part II and Part III.
Continuing on the walkthrough, in this part we focus on getting the data we cleaned in Part IIIready for use in the classification algorithm. These steps are often referred to as data transformation and feature extraction.
Data Transformation and Feature Extraction as a Concept
The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).
As you might imagine, this is quite an open-ended process, and hence a lot of the value that data scientists provide comes in these steps. There is no textbook or walkthrough that can tell you exactly what steps you should take for a given dataset, that knowledge can come only from experience, curiosity and trial and error. However, we can take a look at some common methods to provide a sense of what is possible. Please keep in mind this is not an exhaustive list of options.
Covering steps taken to modify the data, data transformation is undertaken with the intention to enhance the ability of the classification algorithm to extract information from the data. Below are a few common data transformation methods used.
A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.
Typically this technique is used to manage ‘noisy data’. To understand what this means, think of the movements of the stock market over time: it goes up and down on an almost daily basis. However, if you are trying to predict the overall direction of the stock market over the next 6 months, these daily movements become kind of irrelevant – what you really want your model to focus on are the movements over longer periods of time. What is more, the essentially random daily movements in stock prices may actually confuse your prediction model – causing less accurate predictions. In this example, the daily movements are the noise and what you want to extract (the longer term direction of the market) is ‘the signal’.
The same logic can be applied to any numerical field in your dataset. If you are concerned that small changes in a given value may simply be representing random ‘noise’, you may want to consider bucketing/binning to remove that noise.
Although normalization can take on a large number of meanings depending on the context, the type of normalization being referred to here is the statistical type – converting the values of a column into a ‘normalized’ range. This could be translating heights from centimeter values anywhere from 100cm to 220cm to a scale where 0 represents the average (mean) height for your dataset and -1/+1 represent one standard deviation from that average. It could be translating those heights into a range of values from 0 to 1, where 0 is the lowest value in your dataset and 1 is the maximum value. There is a number of other methods that can be used here as well.
This type of transformation is more important for certain types of algorithms than others. For some algorithms – like the one we will be using – this type of transformation is not typically necessary. But for other algorithms, the magnitude of the values in each column will impact the calculations. In these cases, it is optimal to convert (‘normalize’) the values in each column onto the same scale to ensure each column is treated the equally. For a more detailed explanation on this subject, this answer from Quora is a good place to start.
Other Mathematical Transformations
In a similar manner to normalization, there is an almost unlimited number of ways that the numerical values of a given column can be transformed such that they are more suitable for the algorithm being used.
To provide one example, arguably the most common transformation (other than normalization) is to use a logarithm function. This transformation is a commonly used method of dealing with exponential data series (i.e. a column where there a lot of low values and relatively few high values). For those wanting to understand this transformation better, the Wikipedia page on this topic has a great illustrated example.
As I am hemorrhaging readers at this point, I won’t go into detail on the various other transformations possible – the key point is to be aware that there is a large range of possibilities here depending on your needs.
One Hot Encoding
Looking at one more example, and the most relevant one for our Kaggle competition, this transformation is one used for categorical data. What this transformation does is take one column with x categories (x must be greater than 2 for this to make sense) and convert it into x columns where each column represents one category in the original column. An illustrated example is shown below:
For those familiar with regression modeling, you may recognize this as the same process of creating dummy variables.
Again there are a few reasons for doing this type of transformation. Some algorithms are structured in such a way that they do not handle categorical data very well – particularly when the categories do not have an inherent order (this answer on Stack Overflow does a good job of explaining why). Some other types of algorithms require numerical data to function. The only way to work out whether this transformation will be beneficial is to either read through the documentation for the algorithm you are using or to test it yourself.