Importance of Pre-Processing in Machine Learning
Learn how pre-processing improves the performance of machine learning models.
Photo by DeepMind on Unsplash
It is quite obvious that ML teams developing new models or algorithms expect that the performance of the model on test data will be optimal.
But many times that just doesn’t happen.
The reasons could be many, but the top culprits are:
- Lack of sufficient data
- Poor quality data
- Bad choice of algorithm
- Hyperparameter tuning
- Bias in the dataset
The above list is not exhaustive though.
In this article, we’ll discuss the process which can solve multiple above-mentioned problems and ML teams be very mindful while executing it.
It’s pre-processing of data.
It is widely accepted in the machine learning community that preprocessing data is an important step in the ML workflow and it can improve the performance of the model.
There are many studies and articles that have shown the importance of preprocessing data in machine learning, such as:
"A study by Bezdek et al. (1984) found that preprocessing the data improved the accuracy of several clustering algorithms by up to 50%."
"A study by Chollet (2018) found that data preprocessing techniques such as data normalization and data augmentation can improve the performance of deep learning models."
It's also worth mentioning that preprocessing techniques are not only important for improving the performance of the model but also for making the model more interpretable and robust.
For example, handling missing values, removing outliers and scaling the data can help to prevent overfitting, which can lead to models that generalize better to new data.
In any case, it's important to note that the specific preprocessing techniques and the extent of preprocessing that are required for a given dataset will depend on the nature of the data and the specific requirements of the algorithm.
It's also important to keep in mind that in some cases, preprocessing the data may not be necessary or may even harm the performance of the model.
Preprocessing data before applying it to a machine learning (ML) algorithm is a crucial step in the ML workflow.
This step helps to ensure that the data is in a format that the algorithm can understand and that it is free of errors or outliers that can negatively impact the model's performance.
In this article, we will discuss some of the advantages of preprocessing data and provide a code example of how to preprocess data using the popular Python library, Pandas.
Advantages of Preprocessing Data
One of the main advantages of preprocessing data is that it helps to improve the accuracy of the model. By cleaning and formatting the data, we can ensure that the algorithm is only considering relevant information and that it is not being influenced by any irrelevant or incorrect data.
This can lead to a more accurate and robust model.
Another advantage of preprocessing data is that it can help to reduce the time and resources required to train the model. By removing irrelevant or redundant data, we can reduce the amount of data that the algorithm needs to process, which can greatly reduce the amount of time and resources required to train the model.
Preprocessing data can also help to prevent overfitting. Overfitting occurs when a model is trained on a dataset that is too specific, and as a result, it performs well on the training data but poorly on new, unseen data.
By preprocessing the data and removing irrelevant or redundant information, we can help to reduce the risk of overfitting and improve the model's ability to generalize to new data.
Preprocessing data can also improve the interpretability of the model. By cleaning and formatting the data, we can make it easier to understand the relationships between different variables and how they are influencing the model's predictions.
This can help us to better understand the model's behavior and make more informed decisions about how to improve it.
Now, let's see an example of preprocessing data using Pandas. We will use a dataset that contains information about wine quality. The dataset has several features such as alcohol, chlorides, density, etc, and a target variable, the quality of the wine.
import pandas as pd # Load the data data = pd.read_csv("winequality.csv") # Check for missing values print(data.isnull().sum()) # Drop rows with missing values data = data.dropna() # Check for duplicate rows print(data.duplicated().sum()) # Drop duplicate rows data = data.drop_duplicates() # Check for outliers Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1 data = data[ ~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1) ] # Scale the data from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data_scaled = scaler.fit_transform(data) # Split the data into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( data_scaled, data["quality"], test_size=0.2, random_state=42 )
In this example, we first load the data using the read_csv function from Pandas and then check for missing values using the isnull function. We then remove the rows with missing values using the dropna function.
Next, we check for duplicate rows using the duplicated function and remove them using the drop_duplicates function.
We then check for outliers using the interquartile range (IQR) method, which calculates the difference between the first and third quartiles. Any data points that fall outside of 1.5 times the IQR are considered outliers and are removed from the dataset.
After handling missing values, duplicate rows, and outliers, we scale the data using the StandardScaler function from the sklearn.preprocessing library. Scaling the data is important because it helps to ensure that all variables are on the same scale, which is necessary for most machine learning algorithms to function correctly.
Finally, we split the data into training and testing sets using the train_test_split function from the sklearn.model_selection library. This step is necessary for evaluating the model's performance on unseen data.
What if I ignored it?
Not preprocessing data before applying it to a machine learning algorithm can have several negative consequences. Some of the main issues that can arise are:
- Poor model performance: If the data is not cleaned and formatted correctly, the algorithm may not be able to understand it correctly, which can lead to poor model performance. This can be caused by missing values, outliers, or irrelevant data that is not removed from the dataset.
- Overfitting: If the dataset is not cleaned and preprocessed, it may contain irrelevant or redundant information that can lead to overfitting. Overfitting occurs when a model is trained on a dataset that is too specific, and as a result, it performs well on the training data but poorly on new, unseen data.
- Longer training times: Not preprocessing data can lead to longer training times, as the algorithm may need to process more data than is necessary, which can be time-consuming.
- Difficulty in understanding the model: If the data is not preprocessed, it can be difficult to understand the relationships between different variables and how they are influencing the model's predictions. This can make it harder to identify errors or areas for improvement in the model.
- Biased results: If the data is not preprocessed, it may contain errors or biases that can lead to unfair or inaccurate results. For example, if the data contains missing values, the algorithm may be working with a biased sample of the data, which can lead to incorrect conclusions.
In general, not preprocessing data can lead to models that are less accurate, less interpretable, and more difficult to work with. Preprocessing data is an important step in the machine learning workflow that should not be skipped.
In conclusion, preprocessing data before applying it to a machine learning algorithm is a crucial step in the ML workflow. It helps to improve the accuracy, reduce the time and resources required to train the model, prevent overfitting, and improve the interpretability of the model.
The above code example demonstrates how to preprocess data using the popular Python library, Pandas, but there are many other libraries available for preprocessing data, such as NumPy and Scikit-learn, that can be used depending on the specific needs of your project.
Sumit Singh is a serial entrepreneur working towards Data Centric AI. He co-founded next gen training data platform Labellerr. Labellerr’s platform allows AI-ML teams to automate their data preparation pipeline at ease.