How to Manage Categorical Data Effectively with Pandas

Let's learn about categorical data in Pandas.



How to Manage Categorical Data Effectively with Pandas
 

Let’s try to learn about categorical data in Pandas.
 

Preparation

 
Before we start, we need the Pandas and Numpy packages installed. You can install them using the following code:

pip install pandas numpy

 

With the packages installed, let's jump into the main part of the article.

 

Manage Categorical Data in Pandas

 

Categorical data is a Pandas data type representing particular (fixed) numbers of class or distinct values. It’s different from the string or object data type in Pandas, especially in the way Pandas store the data.

Categorical data is more memory-efficient as the values in categorical data are only stored once. In contrast, object data types store each value as a separate string, which requires much more memory.

Let’s try out the categorical data with an example. Below is how we can initiate the categorical data with Pandas.

import pandas as pd

df = pd.DataFrame({
    'fruits': pd.Categorical(['apple', 'kiwi', 'watermelon', 'kiwi', 'apple', 'kiwi']),
    'size': pd.Categorical(['small', 'large', 'large', 'small', 'large', 'small'])
})
df.info()

 

Output:

RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   fruits  6 non-null      category
 1   size    6 non-null      category
dtypes: category(2)
memory usage: 396.0 bytes

 

You can see the data type for column fruits, and the size is a category instead of an object, as we usually get.

We can try to compare the memory usage for the categorical and object data types with the following code:

import numpy as np

n = 100000

df_object = pd.DataFrame({
    'fruit': np.random.choice(['apple', 'banana', 'orange'], size=n)
})

print('Memory usage with object type:')
print(df_object['fruit'].memory_usage(deep=True))


df_category = pd.DataFrame({
    'fruit': pd.Categorical(np.random.choice(['apple', 'banana', 'orange'], size=n))
})

print('Memory usage with categorical type:')
print(df_category['fruit'].memory_usage(deep=True))

 

Output:

Memory usage with object type:
6267209
Memory usage with categorical type:
100424

 

You can see that the object type consumes way more memory than the categorical data type, especially with more samples.

Next, we will examine the unique method that categorical data types can use. For example, you can get the categories:

df['fruits'].cat.categories

 

Output:

Index(['apple', 'kiwi', 'watermelon'], dtype='object')

 

Also, we can rename the categories:

df['fruits'] = df['fruits'].cat.rename_categories(['fruit_apple', 'fruit_banana', 'fruit_orange'])
print(df['fruits'].cat.categories)

 

Output:

Index(['fruit_apple', 'fruit_banana', 'fruit_orange'], dtype='object')

 

The categorical data type can also introduce ordinal values, and we can compare categories.

df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)
df['size'] < 'large' 

 

Output:

0     True
1    False
2    False
3     True
4    False
5     True
Name: size, dtype: bool

 

Mastering the categorical data type would give you an edge in the data analysis.

 

Additional Resources

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!