Clean and Validate Your Data Using Pandera

Stop wasting time on dirty data! Learn how to clean it up in minutes with Pandera.



Clean and Validate Your Data Using Pandera
Image by Author | Canva

 

When working with data, it's important to perform checks to make sure our data isn’t dirty or invalid — like checking for nulls, missing values, or numbers that aren't allowed for a specific column type. These checks are essential because bad data can lead to wrong analysis, failed models, and a lot of wasted time and resources.

You’ve probably already seen the usual way of cleaning and validating data using plain old Pandas, but in this tutorial, I want to show you something better: a powerful Python library called Pandera. Pandera offers a flexible and expressive API for performing data validation on DataFrame-like objects. It’s a much faster and more scalable approach compared to manually checking things. You basically create schemas that define how your data is supposed to look — structure, data types, rules, that kind of stuff. Then Pandera checks your data against those schemas and points out anything that doesn’t fit, so you can catch and fix issues early instead of running into problems later.

This guide assumes you already know a bit of Python and Pandas. Let’s walk through the step-by-step process of using Pandera in your workflows.

 

Step 1: Setting Up Your Environment

 
First, you need to install the necessary packages:

pip install pandera pandas

 
After installation, import the required libraries and verify installation:

import pandas as pd
import pandera as pa

print("pandas version:", pd.__version__)
print("pandera version:", pa.__version__)

 
This should display the versions of pandas and Pandera, confirming they’re installed correctly as follows:

pandas version: 2.2.2
pandera version: 0.0.0+dev0

 

Step 2: Creating a Sample Dataset

 
Let’s create a sample dataset of customer information with intentional errors to demonstrate cleaning and validation:

import pandas as pd

# Customer dataset with errors
data = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, "invalid"],  # "invalid" is not an integer
    "name": ["Maryam", "Jane", "", "Alice", "Bobby"],  # Empty name
    "age": [25, -5, 30, 45, 35],  # Negative age is invalid
    "email": ["mrym@gmail.com", "jane.s@yahoo.com", "invalid_email", "alice@google.com", None]  # Invalid email and None
})

print("Original DataFrame:")
print(data)

 

Output:

Original DataFrame:
  customer_id    name  age             email
0           1  Maryam   25    mrym@gmail.com
1           2    Jane   -5  jane.s@yahoo.com
2           3           30     invalid_email
3           4   Alice   45  alice@google.com
4     invalid   Bobby   35              None

 
Issues in the dataset:

  • customer_id: Contains a string ("invalid") instead of integers.
  • name: Has an empty string.
  • age: Includes a negative value (-5).
  • email: Has an invalid format (invalid_email) and a missing value (None).

 

Step 3: Defining a Pandera Schema

 
A Pandera schema defines the expected structure and constraints for the DataFrame. We’ll use DataFrameSchema to specify rules for each column:

import pandera as pa
from pandera import Column, Check, DataFrameSchema

# Define the schema
schema = DataFrameSchema({
    "customer_id": Column(
        dtype="int64",  # Use int64 for consistency
        checks=[
            Check.isin(range(1, 1000)),  # IDs between 1 and 999
            Check(lambda x: x > 0, element_wise=True)  # IDs must be positive
        ],
        nullable=False
    ),
    "name": Column(
        dtype="string",
        checks=[
            Check.str_length(min_value=1),  # Names cannot be empty
            Check(lambda x: x.strip() != "", element_wise=True)  # No empty strings
        ],
        nullable=False
    ),
    "age": Column(
        dtype="int64",
        checks=[
            Check.greater_than(0),  # Age must be positive
            Check.less_than_or_equal_to(120)  # Age must be reasonable
        ],
        nullable=False
    ),
    "email": Column(
        dtype="string",
        checks=[
            Check.str_matches(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")  # Email regex
        ],
        nullable=False
    )
})

 

Step 4: Initial Validation

 
Now, let’s validate our DataFrame against the schema. Pandera provides the validate method to check if the data conforms to the schema. Set lazy=True to collect all errors:

print("\nInitial Validation:")
try:
    validated_df = schema.validate(data, lazy=True)
    print("Data is valid!")
    print(validated_df)
except pa.errors.SchemaErrors as e:
    print("Validation failed with these problems:")
    print(e.failure_cases[['column', 'check', 'failure_case', 'index']])

 

The validation will fail because of the issues in our dataset. The error message will look something like this:

Output:

Initial Validation:
Validation failed with these problems:
        column                                              check  \
0  customer_id                               isin(range(1, 1000))   
1         name                                str_length(1, None)   
2         name                                              
3          age                                    greater_than(0)   
4        email                                       not_nullable   
5        email  str_matches('^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\....   
6  customer_id                                     dtype('int64')   
7  customer_id                                              
8         name                            dtype('string[python]')   
9        email                            dtype('string[python]')   

                                        failure_case index  
0                                            invalid     4  
1                                                        2  
2                                                        2  
3                                                 -5     1  
4                                               None     4  
5                                      invalid_email     2  
6                                             object  None  
7  TypeError("'>' not supported between instances...  None  
8                                             object  None  
9                                             object  None 

 

Step 5: Cleaning the Data

 
Now that we’ve identified the issues, let’s clean the data to make it conform to the schema. We’ll handle each issue step by step:

  • customer_id: Remove rows with non-integer or invalid IDs
  • name: Remove rows with empty names
  • age: Remove rows with negative or unreasonable ages
  • email: Remove rows with invalid or missing emails
# Step 4: Clean the data

# Step 4a: Clean customer_id (convert to int and filter valid IDs)
data["customer_id"] = pd.to_numeric(data["customer_id"], errors="coerce")  # Convert to numeric, invalid to NaN
data = data[data["customer_id"].notna()]  # Remove NaNs first
data = data[data["customer_id"].isin(range(1, 1000))]  # Filter valid IDs
data["customer_id"] = data["customer_id"].astype("int64")  # Force int64

# Step 4b: Clean name (remove empty or whitespace-only names)
data = data[data["name"].str.strip() != ""]
data["name"] = data["name"].astype("string[python]")

# Step 4c: Clean age (keep positive and reasonable ages)
data = data[data["age"] > 0]
data = data[data["age"] <= 120]

# Step 4d: Clean email (remove invalid or missing emails)
data = data[data["email"].notna()]
data = data[data["email"].str.match(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")]
data["email"] = data["email"].astype("string[python]")

# Display cleaned data
print("Cleaned DataFrame:")
print(data)

 

After cleaning, the DataFrame should look like this:

Output:
Cleaned DataFrame:
   customer_id    name  age             email
0          1.0  Maryam   25    mrym@gmail.com
1          4.0   Alice   45  alice@google.com

 

Step 6: Re-Validating the Data

 
Let’s re-validate the cleaned DataFrame to ensure it now conforms to the schema:

print("\nFinal Validation:")
try:
    validated_df = schema.validate(cleaned_data, lazy=True)
    print("Cleaned data is valid!")
    print(validated_df)
except pa.errors.SchemaErrors as e:
    print("Validation failed after cleaning. Errors:")
    print(e.failure_cases[['column', 'check', 'failure_case', 'index']])

 

Output:
Final Validation:
Cleaned data is valid!
   customer_id    name  age             email
0            1  Maryam   25    mrym@gmail.com
3            4   Alice   45  alice@google.com

 
The validation passes, confirming that our cleaning steps resolved all issues.

 

Step 7: Building a Reusable Pipeline

 
To make your workflow reusable, you can encapsulate the cleaning and validation in a pipeline like this:

def process_data(df, schema):
    """
    Process and validate a DataFrame using a Pandera schema.
    Args:
        df: Input pandas DataFrame
        schema: Pandera DataFrameSchema
    Returns:
        Validated and cleaned DataFrame, or None if validation fails
    """
    # Create a copy for cleaning
    data_clean = df.copy()
    
    # Clean customer_id
    data_clean["customer_id"] = pd.to_numeric(data_clean["customer_id"], errors="coerce")
    data_clean = data_clean[data_clean["customer_id"].notna()]
    data_clean = data_clean[data_clean["customer_id"].isin(range(1, 1000))]
    data_clean["customer_id"] = data_clean["customer_id"].astype("int64")
    
    # Clean name
    data_clean = data_clean[data_clean["name"].str.strip() != ""]
    data_clean["name"] = data_clean["name"].astype("string")
    
    # Clean age
    data_clean = data_clean[data_clean["age"] > 0]
    data_clean = data_clean[data_clean["age"] <= 120]
    data_clean["age"] = data_clean["age"].astype("int64")
    
    # Clean email
    data_clean = data_clean[data_clean["email"].notna()]
    data_clean = data_clean[data_clean["email"].str.match(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")]
    data_clean["email"] = data_clean["email"].astype("string")
    
    # Reset index
    data_clean = data_clean.reset_index(drop=True)
    
    # Validate
    try:
        validated_df = schema.validate(data_clean, lazy=True)
        print("Data processing successful!")
        return validated_df
    except pa.errors.SchemaErrors as e:
        print("Validation failed after cleaning. Errors:")
        print(e.failure_cases[['column', 'check', 'failure_case', 'index']])
        return None

# Test the pipeline
print("\nTesting Pipeline:")
final_df = process_data(data, schema)
print("Final Processed DataFrame:")
print(final_df)

 

Output:
Testing Pipeline:
Data processing successful!
Final Processed DataFrame:
   customer_id    name  age             email
0            1  Maryam   25    mrym@gmail.com
1            4   Alice   45  alice@google.com

 
Pandera can be used for other datasets with the same schema.

 

Conclusion

 
Pandera is a powerful tool for ensuring data quality in your pandas workflows. By defining schemas, you can catch errors early, enforce consistency, and automate data cleaning. In this article, we:

  1. Installed Pandera and set up a sample dataset
  2. Defined a schema with rules for data types and constraints
  3. Validated the data and identified issues
  4. Cleaned the data to conform to the schema
  5. Re-validated the cleaned data
  6. Built a reusable pipeline for processing data

Pandera also offers advanced features for complex validation scenarios, such as class-based schemas, cross-field validation, partial validation, and more, which you can explore in the official Pandera documentation.
 
 

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!