How to Handle Missing Data with Scikit-learn’s Imputer Module

In this article, you will learn how to use Scikit-Learn Imputer module to handle missing data to streamline the data science project.



How to Handle Missing Data with Scikit-learn's Imputer Module
Image by Editor | Midjourney & Canva

 

Let’s learn how to use Scikit-learn’s imputer for handling missing data.
 

Preparation

 

Ensure you have the Numpy, Pandas and Scikit-Learn installed in your environment. If not, you can install them via pip using the following code:

 

pip install numpy pandas scikit-learn

 

Then, we can import the packages into your environment:

import numpy as np
import pandas as pd
import sklearn
from sklearn.experimental import enable_iterative_imputer

 

 

Handle Missing Data with Imputer

 

A scikit-Learn imputer is a class used to replace missing data with certain values. It can streamline your data preprocessing process. We will explore several strategies for handling the missing data.

Let’s create a data example for our example:

sample_data = {'First': [1, 2, 3, 4, 5, 6, 7, np.nan,9], 'Second': [np.nan, 2, 3, 4, 5, 6, np.nan, 8,9]}
df = pd.DataFrame(sample_data)
print(df)

 

    First  Second
0    1.0     NaN
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     NaN
7    NaN     8.0
8    9.0     9.0

 

You can fill the columns' missing values with the Scikit-Learn Simple Imputer using the respective column’s mean.

    First  Second
0   1.00    5.29
1   2.00    2.00
2   3.00    3.00
3   4.00    4.00
4   5.00    5.00
5   6.00    6.00
6   7.00    5.29
7   4.62    8.00
8   9.00    9.00

 

For note, we round the result into 2 decimal places.

It’s also possible to impute the missing data with Median using Simple Imputer.

imputer = sklearn.SimpleImputer(strategy='median')
df_imputed = round(pd.DataFrame(imputer.fit_transform(df), columns=df.columns),2)

print(df_imputed)
   First  Second
0    1.0     5.0
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     5.0
7    4.5     8.0
8    9.0     9.0

 

The mean and median imputer approach is simple, but it can distort the data distribution and create bias in a data relationship.

There are also possible to use a K-NN imputer to fill in the missing data using the nearest neighbour approach.

knn_imputer = sklearn.KNNImputer(n_neighbors=2)
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)

print(knn_imputed_df)

 

    First  Second
0    1.0     2.5
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     5.5
7    7.5     8.0
8    9.0     9.0

 

The KNN imputer would use the mean or median of the neighbour's values from the k nearest neighbours.

Lastly, there is the Iterative Impute methodology, which is based on modelling each feature with missing values as a function of other features. As this article states, it’s an experimental feature, so we need to enable it initially.

iterative_imputer = IterativeImputer(max_iter=10, random_state=0)
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = round(pd.DataFrame(iterative_imputed_data, columns=df.columns),2)

print(iterative_imputed_df)

 

    First  Second
0    1.0     1.0
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     7.0
7    8.0     8.0
8    9.0     9.0

 

If you can properly use the imputer, it could help make your data science project better.

 

Additional Resouces

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!