The Optimal Way to Input Missing Data with Pandas fillna()

Missing data is common in real-life datasets. To fill in the missing data, Pandas provide various methods with fillna that you might need to learn.

By Cornellius Yudha Wijaya, KDnuggets Technical Content Specialist on February 13, 2023 in Data Science

The Optimal Way to Input Missing Data with Pandas fillna()

Image by catalyststuff on Freepik

During the data exploration phase, we often encounter variables with missing data. The missing data could exist for various reasons; sampling mistakes, deliberately missing, or a random reason. Whatever the cause, we need to analyse the missing data reasons. An article regarding missing data type by Yogita Kinha is a good start.

After appropriate analysis, one way to solve the missing data problem is by filling in the data. Luckily, Pandas allow easy missing data input. How do we do that, and what is the optimal way to fill in the missing data? Let’s learn together.

Pandas Fillna Function

According to the Pandas' documentation, Fillna is a Pandas function to fill the NA/NaN values with the specified method. In the Pandas DataFrame, we specify the missing data object as the NaN object. Using Fillna, we would replace these NaN values with another value we had analysed.

Let’s try out the function with a dataset example. This article will use the Local Epidemics of Dengue Fever train dataset from Kaggle (License: CC0: Public Domain).

import pandas as pd
df = pd.read_csv('dengue_features_train.csv')
df.head(10)

As we see in the dataset above, there are missing data in the ‘ndvi_ne’ column. Using the Pandas fillna function, we can easily replace the missing data with another value. Let me give you an example.

df.fillna(0).head(10)

With the fillna function, we replace the missing data with the value 0. You can replace it with any kind of value when using the fillna function. For example, I replace the missing values with the string ‘zero’.

df.fillna('zero').head(10)

Or I could even replace the missing values with function, which you could do but was not useful.

df.fillna(pd.isna).head(10)

On a side note, the fillna function would not change the actual dataset when you execute them. You could run the following code if you want the DataFrame to be replaced when you execute the function.

df.fillna(0, inplace = True)

There would be no output when you run the code above but your DataFrame would be affected. Don’t use the parameter inplace if you are still experimenting with the data.

Replace missing values on multiple columns

You must be careful when using the fillna function. If we run the function while using the whole DataFrame, it would fill every missing data with the passed values, even if it is not your intention. Let’s see what I am talking about by using the data example.

df[df['ndvi_ne'].isna()]

I tried to take all the observations where the ‘ndvi_ne’ column was missing. If we see the output above, we can see that several columns also contain missing data. Let’s try to use the fillna function to fill them up.

df[df['ndvi_ne'].isna()].fillna('zero')

All the missing data is now replaced with the string ‘zero’ values. Often time, this is not what we want. If we want to replace the missing data on certain columns, we could take the column first before using the fillna function.

df['ndvi_ne'].fillna(0)

There is also an optimal way to fill the missing data by passing a dictionary containing the column's name as the key and what to replace as the values. Let’s try it out with the code example.

df[df['ndvi_ne'].isna()].fillna({'ndvi_ne':0,
                                 'ndvi_nw':'zero', 
                                 'ndvi_se': df['ndvi_se'].mean()})

With the code above, we replace the column ‘ndvi_ne’ with 0, ‘ndvi_nw’ with ‘zero’ and ‘ndvi_se’ with the column mean. The rest were untouched as we didn’t specify them in the function.

Fill n-th consecutive missing data

The Pandas fillna function also allowed the user to specify the number of missing data to be replaced. By using the limit parameter, we can fill in the missing data to the n-th missing data consecutively. Let’s try with the code example.

df[df['ndvi_ne'].isna()].fillna(0, limit = 3).head()

We can see from the above output that only three out of five missing data rows were replaced. If we change the limit parameter, we can see a different result.

df[df['ndvi_ne'].isna()].fillna(0 , limit = 2).head()

Only two out of five data shown were replaced. The missing data do not need to be on top of each other. They can be in different rows, and the limit parameter would only replace the first two missing data if the limit parameter is set to two.

Forward and backward fill

What is good about the Pandas fillna function is that we can fill in the missing data from the preceding or the succession observation. Let’s try to fill in the data from the preceding observation. As a reminder, we have missing data in the following column.

df['ndvi_ne'].head(10)

Then, we would use the fillna function to replace the missing data from the previous row.

df['ndvi_ne'].head(10).fillna(method = 'ffill')

The missing data is now replaced with values from the previous rows, or we could call it forward fill. Let’s try the reverse: the backward fill or filling up missing data from the succession rows.

df['ndvi_ne'].head(10).fillna(method = 'bfill')

We can see from the output above that the last data is still missing. Because we don’t have any observation after the missing data row, the function keeps it as it is.

The forward and backward fill method is a good function if you know the previous and the data after are still related, such as in the time series data. Imagine stock data; the previous day's data might still be applicable the day after.

Conclusion

Missing data is a typical occurrence during data preprocessing and exploration. One thing to do with the missing data is to replace it with another value. To do that, we can use the Pandas function called fillna. Using the function is simple, but there are a few methods to optimally fill up our data, including replacing missing data in multiple columns, limiting the imputation, and using other rows to fill the data.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.

The Optimal Way to Input Missing Data with Pandas fillna()

Pandas Fillna Function

Replace missing values on multiple columns

Fill n-th consecutive missing data

Forward and backward fill

Conclusion

More On This Topic

Latest Posts

Top Posts