Dealing with categorical features in machine learning
Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.
By Hugo Ferreira, Data Scientist | Machine Learning Enthusiast | Physicist
How to easily implement one-hot encoding in Python
Categorical data are commonplace in many Data Science and Machine Learning problems but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.
One of the most common ways to make this transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature ‘City’ with names of cities such as ‘London’, ‘Lisbon’, ‘Berlin’, etc.). For each unique value of a feature (say, ‘London’) one column is created (say, ‘City_London’) where the value is 1 if for that instance the original feature takes that value and 0 otherwise.
Even though this type of encoding is used very frequently, it can be frustrating to try to implement it using scikit-learn in Python, as there isn’t currently a simple transformer to apply, especially if you want to use it as a step of your machine learning pipeline. In this post, I’m going to describe how you can still implement it using only scikit-learn and pandas (but with a bit of effort). But, after that, I’ll also show you how you can use the category encoders library to achieve the same thing in a much easier fashion.
Load the data
The data we’re going to use is the Breast Cancer Data Set from the UCI Machine Learning Repository. This data set is small and contains several categorical features, which will allow us to quickly explore a few ways to implement the one-hot encoding using Python, pandas and scikit-learn.
After downloading the data from the repository, we read it into a pandas dataframe
This data set has 286 instances with 9 features and one target (‘Class’). The target and features are described in the data set description as follows:
We’ll take all columns to be of ‘object’ type and split the training and test sets using the
train_test_split of scikit-learn.
There are several ways to encode categorical features (see, for example, here). In this post, we will focus on one of the most common and useful ones, one-hot encoding. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature.
For example, suppose we have the following categorical feature with three different unique values.
After one-hot encoding, the data set looks like:
We want to implement the one-hot encoding to the breast cancer data set, in such a way that the resulting sets are suitable to use in machine learning algorithms. Note that for many of the features of this data set there is a natural ordering between the categories (e.g. the tumour size) and, therefore, other types of encoding might be more appropriate, but for concreteness we will focus only on one-hot encoding in this post.
Let’s see how we would implement one-hot encoding using scikit-learn. There is a transformer conveniently named
OneHotEncoder which, at first glance, seems to be exactly what we’re looking for.
If we try to apply the above code, we obtain an
OneHotEncoder requires that all values are integers, and not strings as we have. This means we first have to encode all the possible values as integers: for a given feature, if it has n possible values (given by n different strings), we encode them with integers between 0 and n-1. Thankfully, there is another transformer in scikit-learn, called
LabelEncoder, which does just that!
And… we obtain another
ValueError! In reality,
LabelEncoder is only intended to be used for the target vector, and as such it doesn’t work with more than one column. Unfortunately, in version 0.19 of scikit-learn, there is no transformer which can deal with several columns (there is some hope for version 0.20).
One solution is to make our own transformer, which we creatively call
MultiColumnLabelEncoder, which applies the
LabelEncoder in each of the features.
Will this work this time?
It worked! To better understand what happened, let’s check the original training set.
We see, for instance, that the age group 30–39 was given the label 0, 40–49 was given the label 1, etc, and analogously for the other features.
After applying the
MultiColumnLabelEncoder, we can (finally!) use the
OneHotEncoder to implement the one-hot encoding to both the training and test sets.
OneHotEncoderis fitted to the training set, which means that for each unique value present in the training set, for each feature, a new column is created. We have 39 columns after the encoding.
- The output is a numpy array (when the option
sparse=Falseis used), which has the disadvantage of losing all the information about the original column names and values.
- When we try to transform the test set, after having fitted the encoder to the training set, we obtain (again!) a
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.
This last problem can be solved by using the option
handle_unknown='ignore' of the
OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
And that’s it! Both the training and test sets have 39 columns and are now in a suitable form to be used in machine learning algorithms which require numerical data.
However, the procedure shown above is quite inelegant and we lose the dataframe format for the data. Is there an easier way to implement all of this?
Using category encoders
Category Encoders is a library of scikit-learn-compatible categorical variable encoders, such as:
- Helmert Contrast
- Sum Contrast
- Polynomial Contrast
- Backward Difference Contrast
- Target Encoding
The Ordinal, One-Hot and Hashing encoders are improved versions of the ones present in scikit-learn with the following advantages:
- Support for pandas dataframes as an input and, optionally, as output;
- Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type;
- Compatibility with scikit-learn pipelines.
For our purposes, we’re going to use the improved
OneHotEncoder and see how much we can simplify the workflow. First, we import the category encoders library.
Then, let’s try to apply the
OneHotEncoder directly to both the training and test sets.
It worked immediately! No need to use
- The outputs are dataframes, which allows us to more easily check how the several features were transformed. As before, each transformed set has 39 columns and it can be checked that the matrices of 0’s and 1’s are identical to the ones obtained previously.
- I used the option
use_cat_names=Trueso that the possible values of each feature are added to the feature name in each new column (e.g.
age_60-69). This was done for clarity, but by default it simply adds an index (e.g.
- We can now more easily understand why we previously obtained an error when we didn’t require that unseen values in the test set were ignored. Below, we show all the columns of the test set after one-hot encoding, when the encoder is fitted to the training set (first) and when is fitted to the test set (second). We see that, in the second case, there are now 42 columns, as 3 new columns were created:
breast-quad_unknown. These corresponds to three values in the test set (20–29 in
age, 5–9 in
tumor-sizeand unknown in
breast-quad) that are not present in the training set.
Comparing the two approaches above, it’s clear that we should use the
OneHotEncoder from the category encoders library if we intend to one-hot encode the categorical features in our data set.
As remarked in the beginning, one-hot encoding is not the only possible way to encode categorical features and the category encoders library has several encoders which you should explore, as others might be more appropriate for different categorical features and machine learning problems.
You can find my LinkedIn in:
Bio: Hugo Ferreira is a Data Scientist with a background in Mathematics and Physics and spent several years doing research in the framework and computational tools to analyse previously unexplored complex systems of interest to Astrophysics and Quantum Physics.
Original. Reposted with permission.
- The Hitchhiker’s Guide to Feature Extraction
- 7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition
- A Quick Guide to Feature Engineering