Approaches to Data Imputation

This guide will discuss what data imputation is as well as the types of approaches it supports.

By Nahla Davies, KDnuggets on January 12, 2023 in Machine Learning

Photo by Ron Lach

Real-world data sets are seldom perfect and often come with missing values or incomplete information. These faults may be due to the human element (incorrectly filled or unfilled surveys) or technology (malfunctioning sensors). Whatever the case is, you are often left with missing values or information.

Of course, this presents a problem. Without the missing values, the entire data set may be deemed unusable. But since it takes considerable time, effort, and (in many cases) money to acquire high-quality data, disposing of the incorrect data and starting again may not be viable options. Instead, we must find a way to work around or replace these missing values. This is where data imputation comes in.

This guide will discuss what data imputation is as well as the types of approaches it supports.

Addressing Missing Data

While we cannot replace missing or corrupt data, there are methods we can employ to allow the data set to be still usable. Data imputation is one of the most reliable techniques for achieving this. However, we must first identify what type of data is missing and why.

In statistics and data science, there are three main types of missing data:

Missing at random (MAR), where the missing data is tied to a variable and can ultimately be observed or traced. In many cases, this can provide you with more information about the demographics or data subjects. For instance, people of a certain age may decide to skip a question on a survey or remove tracking systems from their devices at certain times.
Missing completely at random (MCAR), where the missing data cannot be observed or traced to a variable. It’s nearly impossible to discern why the data is missing.
Missing data that’s not missing at random (NMAR), where the missing data is tied to a variable of interest. In most cases, this missing data can be ignored. NMAR can occur when a survey taker skips a question that doesn’t apply to them.

Dealing With Missing Data

Currently, you have three primary options to deal with missing data values:

Deletion
Imputation
Disregard

Instead of disposing of the entire data set, you can use what is known as list-wise deletion. This involves deleting records with missing information or values. The main advantage of list-wise deletion is that it supports all three categories of missing data.

However, this may result in additional data loss. It is recommended that you only use listwise deletion in instances where there are a greater number of missing (observed) values than present (observed) values, mainly because there isn’t enough data to infer or replace them.

If the observed missing data isn’t important (ignorable) and only a few values are missing, you can ignore them and work with what you have. However, this isn’t always a possibility. Data imputation offers a third and potentially more viable solution.

What Is Data Imputation

Data imputation involves replacing absent values so that data sets can still be usable. There are two categories of data imputation approaches:

Single
Multiple

Mean imputation (MI) is one of the most famous forms of single-data imputation.

Mean Imputation (MI)

MI is a form of simple imputation. This involves calculating the mean of the observed values and using the results to infer the missing values. Unfortunately, this method has been proven to be inefficient. It can lead to many biased estimates, even when the data is missing completely at random. Additionally, the “accuracy” of the estimations depends on the number of missing values.

For instance, if there is a great number of missing observed values, using mean imputation could lead to value underestimation. Thus, it’s better suited for data sets and variables with only a few missing values.

Manual Replacement

In this situation, an operator can use prior knowledge of the values of the data set to replace the missing values. It’s a single imputation method that relies on the memory or knowledge of the operator and is sometimes referred to as prior knowledge of an ideal number. Accuracy hinges on the operator’s ability to recall the values, so this method may be more suitable for data sets with only a few missing values.

K-Nearest Neighbors (K-NN)

K-nearest neighbor is a technique famously used in machine learning to address regression and classification problems. It uses the mean of the missing data value’s neighbors’ missing data value to calculate and impute it. The K-NN method is far more effective than simple mean imputation and is ideal for MCAR and MAR values.

Substitution

Substitution involves finding a new individual or subject to survey or test. This should be a subject who was not selected in the original sample.

Regression Imputation

Regression attempts to determine the strength of a dependent variable (usually specified as Y) to a collection of independent variables (usually denoted as X). Linear regression is the most well-known form of regression. It uses the line of best fit to predict or determine the missing value. Consequently, it’s the best method for representing data visually through a regression model.

When linear regression is a form of deterministic regression where an exact relationship between the missing and present values is established, the missing values are replaced with the 100% prediction of the regression model. There is a limitation to this method, however. Deterministic linear regression can often result in an overestimation of the closeness of the relationship between the values.

Stochastic linear regression compensates for the “over-preciseness” of deterministic regression by introducing a (random) error term because two situations or variables are seldom perfectly connected. This makes filling in missing values using regression more appropriate.

Hot Deck Sampling

This approach involves selecting a randomly chosen value from a subject with other values similar to the subject missing the value. It requires you to search for subjects or individuals and then fill in the missing data using their values.

The hot deck sampling method limits the range of attainable values. For instance, if your sample is restricted to an age group between 20 and 25, your result will always be between these numbers, increasing the potential accuracy of the replacement value. The subjects/individuals for this method of imputation are chosen at random.

Cold Deck Sampling

This method involves searching for an individual/subject that has similar or identical values for all other variables/parameters in the data set. For example, the subject may have the same height, cultural background, and age as the subject whose values are missing. It differs from hot deck sampling in that the subjects are systematically chosen and reused.

Conclusion

While there are many options and techniques for dealing with missing data, prevention is always better than a cure. Researchers must implement stringent planning for experiments and studies. The study must have a clear mission statement or goal in mind.

Often, researchers overcomplicate a study or fail to plan against impediments, which results in missing or insufficient data. It’s always best to simplify the design of the study while placing a precise focus on data collection.

Collect only the data you need to meet the study’s goals and nothing more. You should also ensure that all instruments and sensors involved in the study or experiments are fully functional at all times. Consider creating regular backups of your data/responses as the study progresses.

Missing data is a common occurrence. Even if you implement the best practices, you may still suffer from incomplete data. Fortunately, there are ways to address this problem after the fact.

Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed — among other intriguing things — to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.