Pandas: How to One-Hot Encode Data

In this article, we will explore how to utilize the Pandas for One-Hot encoding categorical data.



Pandas: How to One-Hot Encode Data
Image from Pexels

 

What is One-Hot Encoding

 

One-hot encoding is a data preprocessing step to convert categorical values into compatible numerical representations. 

categorical_column bool_col col_1 col_2 label
value_A True 9 4 0
value_B False 7 2 0
value_D True 9 5 0
value_D False 8 3 1
value_D False 9 0 1
value_D False 5 4 1
value_B True 8 1 1
value_D True 6 6 1
value_C True 0 5 0

 

For example for this dummy dataset, the categorical column has multiple string values. Many machine learning algorithms require the input data to be in numerical form. Therefore, we need some way to convert this data attribute to a form compatible with such algorithms. Thus, we break down the categorical column into multiple binary-valued columns.

 

How to use Pandas Library for One-Hot Encoding

 

Firstly, read the .csv file or any other associated file into a Pandas data frame.

df = pd.read_csv("data.csv")

 

To check unique values and better understand our data, we can use the following Panda functions.

df['categorical_column'].nunique()
df['categorical_column'].unique()

 

For this dummy data, the functions return the following output:

>>> 4
>>> array(['value_A', 'value_C', 'value_D', 'value_B'], dtype=object)

 

For the categorical column, we can break it down into multiple columns. For this, we use pandas.get_dummies() method. It takes the following arguments:

Argument
data: array-like, Series, or DataFrame The original panda's data frame object
columns: list-like, default None List of categorical columns to hot-encode
drop_first: bool, default False Removes the first level of categorical labels

 

To better understand the function, let us work on one-hot encoding the dummy dataset.

 

Hot-Encoding the Categorical Columns

 

We use the get_dummies method and pass the original data frame as data input. In columns, we pass a list containing only the categorical_column header. 

df_encoded = pd.get_dummies(df, columns=['categorical_column', ])

 

The following commands drops the categorical_column and creates a new column for each unique value. Therefore, the single categorical column is converted into 4 new columns where only one of the 4 columns will have a 1 value, and all of the other 3 are encoded 0. This is why it is called One-Hot Encoding.

categorical_column_value_A categorical_column_value_B categorical_column_value_C categorical_column_value_D
1 0 0 0
0 1 0 0
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 1 0 0
0 0 0 1
0 0 1 0
0 0 0 1

 

The problem occurs when we want to one-hot encode the boolean column. It creates two new columns as well.

 

Hot Encoding Binary Columns

 

df_encoded = pd.get_dummies(df, columns=[bool_col, ])

 

bool_col_False bool_col_True
0 1
1 0
0 1
1 0

 

We unnecessarily increase a column when we can have only one column where True is encoded to 1 and False is encoded to 0. To solve this, we use the drop_first argument.

df_encoded = pd.get_dummies(df, columns=['bool_col'], drop_first=True)

 

bool_col_True
1
0
1
0

 

Conclusion

 

The dummy dataset is one-hot encoded where the final result looks like

col_1 col_2 bool A B C D label
9 4 1 1 0 0 0 0
7 2 0 0 1 0 0 0
9 5 1 0 0 0 1 0
8 3 0 0 0 0 1 1
9 0 0 0 0 0 1 1
5 4 0 0 0 0 1 1
8 1 1 0 1 0 0 1
6 6 1 0 0 0 1 1
0 5 1 0 0 1 0 0
1 8 1 0 0 0 1 0

 

The categorical values and boolean values have been converted to numerical values that can be used as input to machine learning algorithms. 
 
 
Muhammad Arham is a Deep Learning Engineer working in Computer Vision and Natural Language Processing. He has worked on the deployment and optimizations of several generative AI applications that reached the global top charts at Vyro.AI. He is interested in building and optimizing machine learning models for intelligent systems and believes in continual improvement.