Gold Blog, Jun 20177 Steps to Mastering Data Preparation with Python

Follow these 7 steps for mastering data preparation, covering the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.

Step 4: Dealing with Outliers

This is not a tutorial on drafting a strategy to deal with outliers in your data when modeling; there are times when including outliers in modeling is appropriate, and there are times when they are not (regardless of what anyone tries to tell you). This is situation-dependent, and no one can make sweeping assertions as to whether your situation belongs in column A or column B.

Can you find the outlier?

Some discussions for dealing with outliers:

Outliers can be the result of poor data collection, or they can be genuinely good, anomalous data. These are 2 different scenarios, and must be approached differently, and so no "one size fits all" advice is applicable here, similar to that of dealing with missing values. A particularly good point of insights from the Analysis Factor article from above is as follows:

One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

We will leave the decision as to whether or not to leave outliers in your dataset or not. However, if your model does call for dealing with outlier data in some manner, here are a few discussions on approaches:


Step 5: Dealing with Imbalanced Data

So, what if your otherwise robust dataset -- lacking both missing values and outliers -- is made up of 2 classes: one which includes 95 percent of the instances, and the other which includes a mere 5 percent? Or worse -- 99.8 vs 0.2 percent?

If so, your dataset is imbalanced, at least as far as the classes are concerned. This can be problematic, in ways which I'm sure do not need to be pointed out. But no need to to toss the data to the side yet; there are, of course, strategies for dealing with this.

Note that, while this may not genuinely be a data preparation task, such a dataset characteristic will make itself known early in the data preparation stage (the importance of EDA), and the validity of such data can certainly be assessed preliminarily during this preparation stage.

First, have a look at this discussion by Tom Fawcett on how to approach:

Next, take a look at this discussion on techniques for handling class imbalance:

Imbalanced data
Recognizing and dealing with imbalance is important.

A good explanation of why we can run into imbalanced data, and why we can do so in some domains much more frequently than in others (from 7 Techniques to Handle Imbalanced Data, linked above):

Data used in these areas often have less than 1% of rare, but “interesting” events (e.g. fraudsters using credit cards, user clicking advertisement or corrupted server scanning its network). However, most machine learning algorithms do not work very well with imbalanced datasets. The following seven techniques can help you, to train a classifier to detect the abnormal class.


Step 6: Data Transformations

Wikipedia defines data transformation as:

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

Transforming data is one of the most important aspects of data preparation, and one which requires more finesse than most others. When missing values manifest themselves in data, they are generally easy to find, and can be (at least, superficially) dealt with by one of the common methods outlined above -- or by more complex measures gained from insight over time in a domain. However, when and if data transformations are required -- to say nothing of the type of transformation required -- is often not as easily identifiable.

A plethora of transformations exist; instead of trying to generalize when and why transformations are useful, let's look at a few specific transformations in order to get a better handle on them.

This overview from Scikit-learn's documentation gives some rationale for some of the most important preprocessing transformations, namely standardization, normalization, and binarization (with a few others thrown in as well):

One-hot encoding
Sample results of a one-hot encoding transformation.

One-hot encoding "transforms categorical features to a format that works better with classification and regression algorithms" (taken from the first link below). See a discussion of the one-hot transformation below, as well as an approach using Pandas:

The log distribution transformation can be useful if "you assume a model form that is non-linear but can be transformed to a linear model" (taken from below). Read a bit more about an under-appreciated type of transformation below:

As stated above, numerous transformations are possible, depending on the data and your requirements. I would like to take a closer look at data transformation in the future, and leave a more in-depth discussion for that time.

Note that this entire discussion is also fully and intentionally skipping any mention of feature selection for a specific reason: it deserves far more than a simple few sentences in this much more broad discussion. A similar guide specifically for feature selection is upcoming, and will be linked here once complete.


Step 7: Finishing Touches & Moving Ahead

Alright. Your data is "clean." For our purposes, this means that you have a valid and usable Pandas DataFrame at this point. But what do you do with it?

If you want to go right to feeding your data into a machine learning algorithm in order to attempt building a model, you probably need your data in a more appropriate representation. In the Python ecosystem, that would generally be a numpy ndarray (or matrix). You can have a look at the following for some preliminary ideas on getting there (from an elementary point of view):

ML pipeline
Very simple data preparation process.

Once you have clean data in a proper representation for machine learning in Python, why not check out the following pair of articles meant to cover the very ground you are now ready for:

What if you don't want to move on to modeling quite yet? Or, what if you do, but you want to output this data into some storage form more suitable to your situation? Here is some information on Pandas DataFrame storage:

Don't forget that there are additional dataset-specific and -related considerations before moving forward, including (especially?) splitting the dataset into a training and a testing set, a process which is applicable to all sorts of machine learning tasks:

And, as pure punishment, here are some additional takes on data preparation in general: