Encoding Categorical Features with MultiLabelBinarizer

Transform multi-label format into a binary matrix for multi-label classification.



Encoding Categorical Features with MultiLabelBinarizer
Image by Author

 

In the past, you might have converted categorical features into numerical ones using One Hot, Label, and Ordinal encoder. You were working with data that have only one label per sample. But how do you deal with samples with multiple labels?

In this mini tutorial, you will learn the difference between multi-class and multi-label. Furthermore, we will apply Scikit-Learn’s MultiLabelBinarizer function to convert iterable of iterables and multilabel targets. 

 

Multi-Class vs. Multi-Label

 

In machine learning, multi-class classification data consists of more than two classes, and each sample is assigned one label. Whereas in multi-label classification, each sample is assigned multiple labels.

 

Encoding Categorical Features with MultiLabelBinarizer
Image from Thamme Gowda 

 

We will review the examples to understand both types of classification tasks. 

 

Multi-Class

 

In Multi-Class, every record of the student has only one label (Major), and there are more than 2 classes. The students can only have either Math, Science, or English as a major.  
 

Encoding Categorical Features with MultiLabelBinarizer
Image by Author

 

Multi-Label

 

In the multi-label, a student can have more than one Major. For example, Nisaha has selected English, Law, and History as her majors. 

As we can also see, the length of the array varies, some of the students have two majors, and some of them have 3. 

The students have 0 to N number of majors. 
 

Encoding Categorical Features with MultiLabelBinarizer
Image by Author

 

Scikit-Learn MultiLabelBinarizer Examples

 

We will now use the Scikit-learn MultiLabelBinarizer to convert iterable of iterables and multilabel targets into binary encoding. 

 

Example 1

 

In the first example, we have transformed the List of Lists to binary encoding using the MultiLabelBinarizer function. The fit_transform understands the data and applies the transformation.

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
print(mlb.fit_transform([["Abid", "Matt"], ["Nisha"]]))

 

Output:

We got an array of 1s and 0s. 

array([[1, 1, 0],
       [0, 0, 1]])

 

Example 2

 

We can also convert a list of dictionaries to a binary matrix indicating the presence of a class label.

After transformation, you can view the class labels by using .classes_

y = mlb.fit_transform(
    [
        {"Abid", "Matt"},
        {"Nisha", "Abid", "Matt"},
        {"Nisha", "Abid", "Sara", "Matt"},
        {"Matt", "Sara"},
    ]
)
print(list(mlb.classes_))

 

Output:

['Abid', 'Matt', 'Nisha', 'Sara']

 

To understand binary matrices, we will convert the output into a Pandas DataFrame with column names as classes. 

res = pd.DataFrame(y, columns=mlb.classes_)
res

 

Just like one-hot encoding, it has represented labels as 1’s and 0s. 

 

Encoding Categorical Features with MultiLabelBinarizer

 

The MultiLabelBinarizer is commonly used in Image and News classification. After the transformation, you can train the simple Random Forest or Neural Networks in no time.
 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.