Data Cleaning with Pandas

This step-by-step tutorial is for beginners to guide them through the process of data cleaning and preprocessing using the powerful Pandas library.

Data Cleaning with Pandas
Image by Author 




If you are into Data Science, then data cleaning might sound like a familiar term to you. If not, let me explain that to you. Our data often comes from multiple resources and is not clean. It may contain missing values, duplicates, wrong or undesired formats, etc.  Running your experiments on this messy data leads to incorrect results. Therefore, it is necessary to prepare your data before it is fed to your model. This preparation of the data by identifying and resolving the potential errors, inaccuracies, and inconsistencies is termed as Data Cleaning

In this tutorial, I will walk you through the process of cleaning the data using Pandas. 




I will be working with the famous Iris dataset. The Iris dataset contains measurements of four features of three species of Iris flowers: sepal length, sepal width, petal length, and petal width.  We will be using the following libraries:

  • Pandas: Powerful library for data manipulation and analysis
  • Scikit-learn: Provides tools for data preprocessing and machine learning


Steps for Data Cleaning


1. Loading the Dataset


Load the Iris dataset using Pandas' read_csv() function:

column_names = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_data = pd.read_csv('data/Iris.csv', names= column_names, header=0)



id sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa


The header=0 parameter indicates that the first row of the CSV file contains the column names (header).


2. Explore the dataset


To get insights about our dataset, we will print some basic information using the built-in functions in pandas




RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            150 non-null    int64  
 1   sepal_length  150 non-null    float64
 2   sepal_width   150 non-null    float64
 3   petal_length  150 non-null    float64
 4   petal_width   150 non-null    float64
 5   species       150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Data Cleaning with Pandas
Output for iris_data.describe()


The info() function is useful to understand the overall structure of the data frame, the number of non-null values in each column, and the memory usage. While the summary statistics provide an overview of numerical features in your dataset.


3. Checking Class Distribution


This is an important step in understanding how the classes are distributed in categorical columns, which is an important task for classification. You can perform this step using the value_counts() function in pandas.




Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: species, dtype: int64


Our results show that the dataset is balanced with an equal number of representations of each species. This sets the base for a fair evaluation and comparison across all 3 classes.


4. Removing Missing Values


Since it is evident from the info() method that we have 5 columns with no missing values, we will skip this step. But if you encounter any missing values, use the following command to handle them:



5. Removing Duplicates


Duplicates can distort our analysis so we remove them from our dataset. We will first check their existence using the below-mentioned command:

duplicate_rows = iris_data.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())



Number of duplicate rows: 0


We do not have any duplicates for this dataset. Nonetheless, the duplicates can be removed via the drop_duplicates() function.



6. One-Hot Encoding


For categorical analysis, we will perform one-hot encoding on the species column. This step is performed due to the tendency of Machine Learning algorithms to work better with numerical data. The one-hot encoding process transforms categorical variables into a binary (0 or 1) format.

encoded_species = pd.get_dummies(iris_data['species'], prefix='species', drop_first=False).astype('int')
iris_data = pd.concat([iris_data, encoded_species], axis=1)
iris_data.drop(columns=['species'], inplace=True)


Data Cleaning with Pandas
Image by Author 


7. Normalization of Float Value Columns


Normalization is the process of scaling numerical features to have a mean of 0 and a standard deviation of 1. This process is done to ensure that the features contribute equally to the analysis. We will normalize the float value columns for consistent scaling.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cols_to_normalize = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
scaled_data =[cols_to_normalize])
iris_data[cols_to_normalize] = scaler.transform(iris_data[cols_to_normalize])


Data Cleaning with Pandas
Output for iris_data.describe() after normalization


8. Save the cleaned dataset


Save the cleaned dataset to the new CSV file.

iris_data.to_csv('cleaned_iris.csv', index=False)


Wrapping Up


Congratulations! You have successfully cleaned your first dataset using pandas. You may encounter additional challenges while dealing with complex datasets. However, the fundamental techniques mentioned here will help you get started and prepare your data for analysis.

Kanwal Mehreen is an aspiring software developer with a keen interest in data science and applications of AI in medicine. Kanwal was selected as the Google Generation Scholar 2022 for the APAC region. Kanwal loves to share technical knowledge by writing articles on trending topics, and is passionate about improving the representation of women in tech industry.