Anonymizing Production Data for Data Science with Mimesis

Learn how to utilize Python's Mimesis library for anonymizing sensitive production data, based on a step-by-step example to try yourself.



Anonymizing Production Data for Data Science with Mimesis
 

Introduction

 
Production data is typically subject to notable privacy and compliance constraints. For this reason, anonymizing such data becomes critical in virtually every real-world data science project involving the launch of a data-driven product, service, or solution.

Mimesis is an open-source Python library that stands out for its ability to generate realistic "fake" data in a high-performance fashion. Mimesis runs locally and provides a free, robust data pipeline solution. This article will show you how to utilize this library for anonymizing sensitive production data, based on a step-by-step example you can easily try in your IDE or a notebook environment.

 

Step-by-Step Procedure

 
Assuming you are new to Mimesis, you may need to install it in your Python environment with a command like:

pip install mimesis

 

Remember to add ! at the beginning of the pip command if you are working in a Google Colab notebook environment or similar.

Now we are ready to start! We will consider a scenario revolving around a software product's tier-based subscription system. For simplicity, we will synthetically generate a toy dataset containing data about customers and their subscription type. There is highly sensitive data in some of the dataset variables, as you can observe below:

import pandas as pd

# Creation of a mock "production" customer dataset
production_data = {
    'user_id': [101, 102, 103, 104],
    'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
    'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
    'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
    'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}

df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())

 

While subscription tiers are not necessarily sensitive data in our example, user names, emails, and phone numbers are. With the aid of Mimesis, we can initialize a provider: a sort of tailored data anonymization template suited to the type of data we have. Since our data observations are associated with people, we can import and use the Person class — a provider that, given a specific language like English and aided by a random seed, can be used to generate fake substitutes for real, sensitive personal data:

from mimesis import Person
from mimesis.locales import Locale

# Initializing a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)

 

From this point onwards, the process to anonymize personally identifiable information (PII) is quite simple. All it takes is replacing the sensitive columns — specified by us — with freshly generated data from the Mimesis person locale generator. This is done by iterating through the DataFrame object containing the whole dataset and calling suitable Mimesis functions to realistically create substitutes for the data, depending on each given attribute:

# 1. Replacing real names with fake, realistic names
df['real_name'] = [person.full_name() for _ in range(len(df))]

# 2. Replacing real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]

# 3. Replacing real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]

# 4. Renaming the column to reflect that it is no longer the real name
df.rename(columns={'real_name': 'anon_name'}, inplace=True)

 

Notice above how Mimesis' Person class provides dedicated functions for generating full names, emails, and telephone numbers, among others. In addition, the name column is renamed to reflect that the name included in the updated dataset is no longer real but anonymized.

We now verify the results by looking at the transformed DataFrame. The sensitive PII fields have completely changed: they are now overwritten with legitimate-looking synthetic data, keeping the overall dataset structured and important information for downstream analyses like subscription_tier absolutely intact.

print("\n--- Anonymized Data for Data Science Analyses ---")
print(df.head())

 

Output:

--- Anonymized Data for Data Science Analyses ---
   user_id         anon_name                    email            phone  \
0      101    Anthony Reilly    archived1911@duck.com     +13312271333   
1      102           Kai Day    suspect2087@yahoo.com  +1-205-759-3586   
2      103  Cleveland Osborn     urgent1912@yahoo.com     +13691067988   
3      104       Zack Holder  johnson1881@example.com  +1-574-481-3676   

  subscription_tier  
0           Premium  
1             Basic  
2             Basic  
3        Enterprise  

 

Fantastic! We have just applied a few simple steps to anonymize several sensitive data fields typically found in real-world, production data science projects and analyses — all for free, thanks to Mimesis being open-source.

To finalize, here are some best practices and observations for conducting the anonymization process we just covered:

  • We replaced the columns directly in the DataFrame. Depending on your context, consider whether this is the right approach, or whether you may want to store the new information in a separate DataFrame if there is a risk of losing the original data.
  • Mimesis operates in a data-consistent fashion, so generated data matches the expected data types.
  • Seeding helps keep generated information consistent across different runs and facilitates reproducibility.

 

Wrapping Up

 
In this article, we have shown how to use Mimesis — a powerful Python library for anonymized and fake data generation — to transform a sensitive production dataset into a version that can be safely used for further analysis without compromising private information like real people's PII.
 
 

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!