Encoding Categorical Features with MultiLabelBinarizer

Transform multi-label format into a binary matrix for multi-label classification.

By Abid Ali Awan, KDnuggets Assistant Editor on January 20, 2023 in Natural Language Processing

Encoding Categorical Features with MultiLabelBinarizer

Image by Author

In the past, you might have converted categorical features into numerical ones using One Hot, Label, and Ordinal encoder. You were working with data that have only one label per sample. But how do you deal with samples with multiple labels?

In this mini tutorial, you will learn the difference between multi-class and multi-label. Furthermore, we will apply Scikit-Learn’s MultiLabelBinarizer function to convert iterable of iterables and multilabel targets.

Multi-Class vs. Multi-Label

In machine learning, multi-class classification data consists of more than two classes, and each sample is assigned one label. Whereas in multi-label classification, each sample is assigned multiple labels.

Image from Thamme Gowda

We will review the examples to understand both types of classification tasks.

Multi-Class

In Multi-Class, every record of the student has only one label (Major), and there are more than 2 classes. The students can only have either Math, Science, or English as a major.

Image by Author

Multi-Label

In the multi-label, a student can have more than one Major. For example, Nisaha has selected English, Law, and History as her majors.

As we can also see, the length of the array varies, some of the students have two majors, and some of them have 3.

The students have 0 to N number of majors.

Image by Author

Scikit-Learn MultiLabelBinarizer Examples

We will now use the Scikit-learn MultiLabelBinarizer to convert iterable of iterables and multilabel targets into binary encoding.

Example 1

In the first example, we have transformed the List of Lists to binary encoding using the MultiLabelBinarizer function. The fit_transform understands the data and applies the transformation.

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
print(mlb.fit_transform([["Abid", "Matt"], ["Nisha"]]))

Output:

We got an array of 1s and 0s.

array([[1, 1, 0],
       [0, 0, 1]])

Example 2

We can also convert a list of dictionaries to a binary matrix indicating the presence of a class label.

After transformation, you can view the class labels by using .classes_

y = mlb.fit_transform(
    [
        {"Abid", "Matt"},
        {"Nisha", "Abid", "Matt"},
        {"Nisha", "Abid", "Sara", "Matt"},
        {"Matt", "Sara"},
    ]
)
print(list(mlb.classes_))

Output:

['Abid', 'Matt', 'Nisha', 'Sara']

To understand binary matrices, we will convert the output into a Pandas DataFrame with column names as classes.

res = pd.DataFrame(y, columns=mlb.classes_)
res

Just like one-hot encoding, it has represented labels as 1’s and 0s.

The MultiLabelBinarizer is commonly used in Image and News classification. After the transformation, you can train the simple Random Forest or Neural Networks in no time.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Encoding Categorical Features with MultiLabelBinarizer

Multi-Class vs. Multi-Label

Multi-Class

Multi-Label

Scikit-Learn MultiLabelBinarizer Examples

Example 1

Example 2

More On This Topic

Latest Posts

Top Posts