Silver BlogDealing with categorical features in machine learning

Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.



By Hugo Ferreira, Data Scientist | Machine Learning Enthusiast | Physicist

How to easily implement one-hot encoding in Python

figure-name

Photo by Max Nelson on Unsplash

Categorical data are commonplace in many Data Science and Machine Learning problems but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.

One of the most common ways to make this transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature ‘City’ with names of cities such as ‘London’, ‘Lisbon’, ‘Berlin’, etc.). For each unique value of a feature (say, ‘London’) one column is created (say, ‘City_London’) where the value is 1 if for that instance the original feature takes that value and 0 otherwise.

Even though this type of encoding is used very frequently, it can be frustrating to try to implement it using scikit-learn in Python, as there isn’t currently a simple transformer to apply, especially if you want to use it as a step of your machine learning pipeline. In this post, I’m going to describe how you can still implement it using only scikit-learn and pandas (but with a bit of effort). But, after that, I’ll also show you how you can use the category encoders library to achieve the same thing in a much easier fashion.

To illustrate the whole process, I’m going to use the Breast Cancer Data Setfrom the UCI Machine Learning Repository, which has many categorical features on which to implement the one-hot encoding.

 

Load the data

 
The data we’re going to use is the Breast Cancer Data Set from the UCI Machine Learning Repository. This data set is small and contains several categorical features, which will allow us to quickly explore a few ways to implement the one-hot encoding using Python, pandas and scikit-learn.

After downloading the data from the repository, we read it into a pandas dataframe df.

import pandas as pd# names of columns, as per description
cols_names = ['Class', 'age', 'menopause', 'tumor-size',
              'inv-nodes', 'node-caps', 'deg-malig', 'breast',
              'breast-quad', 'irradiat']# read the data
df = (pd.read_csv('breast-cancer.data',
                 header=None, names=cols_names)
        .replace({'?': 'unknown'}))  # NaN are represented by '?'


This data set has 286 instances with 9 features and one target (‘Class’). The target and features are described in the data set description as follows:

 1. Class: no-recurrence-events, recurrence-events.
 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
 3. menopause: lt40, ge40, premeno.
 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,45-49, 50-54, 55-59.
 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
 6. node-caps: yes, no.
 7. deg-malig: 1, 2, 3.
 8. breast: left, right.
 9. breast-quad: left-up, left-low, right-up, right-low, central.
10. irradiat: yes, no.


We’ll take all columns to be of ‘object’ type and split the training and test sets using the train_test_split of scikit-learn.

from sklearn.model_selection import train_test_splitX = df.drop(columns='Class')
y = df['Class'].copy()X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)


 

One-hot encoding

 
There are several ways to encode categorical features (see, for example, here). In this post, we will focus on one of the most common and useful ones, one-hot encoding. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature.

For example, suppose we have the following categorical feature with three different unique values.

+---------+
| Feature |
+---------+
| value_1 |
| value_2 |
| value_3 |
+---------+


After one-hot encoding, the data set looks like:

+-----------------+-----------------+-----------------+
| Feature_value_1 | Feature_value_2 | Feature_value_3 |
+-----------------+-----------------+-----------------+
|               1 |               0 |               0 |
|               0 |               1 |               0 |
|               0 |               0 |               1 |
+-----------------+-----------------+-----------------+


We want to implement the one-hot encoding to the breast cancer data set, in such a way that the resulting sets are suitable to use in machine learning algorithms. Note that for many of the features of this data set there is a natural ordering between the categories (e.g. the tumour size) and, therefore, other types of encoding might be more appropriate, but for concreteness we will focus only on one-hot encoding in this post.

 

Using scikit-learn

 
Let’s see how we would implement one-hot encoding using scikit-learn. There is a transformer conveniently named OneHotEncoder which, at first glance, seems to be exactly what we’re looking for.

from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder(sparse=False)
X_train_ohe = ohe.fit_transform(X_train)


If we try to apply the above code, we obtain an ValueError, as OneHotEncoder requires that all values are integers, and not strings as we have. This means we first have to encode all the possible values as integers: for a given feature, if it has n possible values (given by n different strings), we encode them with integers between 0 and n-1. Thankfully, there is another transformer in scikit-learn, called LabelEncoder, which does just that!

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
X_train_le = le.fit_transform(X_train)


And… we obtain another ValueError! In reality, LabelEncoder is only intended to be used for the target vector, and as such it doesn’t work with more than one column. Unfortunately, in version 0.19 of scikit-learn, there is no transformer which can deal with several columns (there is some hope for version 0.20).

One solution is to make our own transformer, which we creatively call MultiColumnLabelEncoder, which applies the LabelEncoder in each of the features.

class MultiColumnLabelEncoder:

    def __init__(self, columns = None):
        self.columns = columns # list of column to encode    def fit(self, X, y=None):
        return self    def transform(self, X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''

        output = X.copy()

        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)

        return output    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)


Will this work this time?

It worked! To better understand what happened, let’s check the original training set.

We see, for instance, that the age group 30–39 was given the label 0, 40–49 was given the label 1, etc, and analogously for the other features.

After applying the MultiColumnLabelEncoder, we can (finally!) use the OneHotEncoder to implement the one-hot encoding to both the training and test sets.

Some comments:

  • The OneHotEncoder is fitted to the training set, which means that for each unique value present in the training set, for each feature, a new column is created. We have 39 columns after the encoding.
  • The output is a numpy array (when the option sparse=False is used), which has the disadvantage of losing all the information about the original column names and values.
  • When we try to transform the test set, after having fitted the encoder to the training set, we obtain (again!) a ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.

This last problem can be solved by using the option handle_unknown='ignore' of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

And that’s it! Both the training and test sets have 39 columns and are now in a suitable form to be used in machine learning algorithms which require numerical data.

However, the procedure shown above is quite inelegant and we lose the dataframe format for the data. Is there an easier way to implement all of this?

There is!

 

Using category encoders

 
Category Encoders is a library of scikit-learn-compatible categorical variable encoders, such as:

  • Ordinal
  • One-Hot
  • Binary
  • Helmert Contrast
  • Sum Contrast
  • Polynomial Contrast
  • Backward Difference Contrast
  • Hashing
  • BaseN
  • LeaveOneOut
  • Target Encoding

The Ordinal, One-Hot and Hashing encoders are improved versions of the ones present in scikit-learn with the following advantages:

  • Support for pandas dataframes as an input and, optionally, as output;
  • Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type;
  • Compatibility with scikit-learn pipelines.

For our purposes, we’re going to use the improved OneHotEncoder and see how much we can simplify the workflow. First, we import the category encoders library.

import category_encoders as ce


Then, let’s try to apply the OneHotEncoder directly to both the training and test sets.

It worked immediately! No need to use LabelEncoder first!

Some observations:

  • The outputs are dataframes, which allows us to more easily check how the several features were transformed. As before, each transformed set has 39 columns and it can be checked that the matrices of 0’s and 1’s are identical to the ones obtained previously.
  • I used the option use_cat_names=True so that the possible values of each feature are added to the feature name in each new column (e.g. age_60-69). This was done for clarity, but by default it simply adds an index (e.g. age_1, age_2, etc).
  • We can now more easily understand why we previously obtained an error when we didn’t require that unseen values in the test set were ignored. Below, we show all the columns of the test set after one-hot encoding, when the encoder is fitted to the training set (first) and when is fitted to the test set (second). We see that, in the second case, there are now 42 columns, as 3 new columns were created: age_20-29, tumor-size_5-9 and breast-quad_unknown. These corresponds to three values in the test set (20–29 in age, 5–9 in tumor-sizeand unknown in breast-quad) that are not present in the training set.

Comparing the two approaches above, it’s clear that we should use theOneHotEncoder from the category encoders library if we intend to one-hot encode the categorical features in our data set.

As remarked in the beginning, one-hot encoding is not the only possible way to encode categorical features and the category encoders library has several encoders which you should explore, as others might be more appropriate for different categorical features and machine learning problems.

You can find my LinkedIn in:

Hugo Ferreira - Researcher - Instituto Superior Técnico | LinkedIn
View Hugo Ferreira's profile on LinkedIn, the world's largest professional community. Hugo has 3 jobs listed on their…

 
Bio: Hugo Ferreira is a Data Scientist with a background in Mathematics and Physics and spent several years doing research in the framework and computational tools to analyse previously unexplored complex systems of interest to Astrophysics and Quantum Physics.

Original. Reposted with permission.

Related: