6 Common Mistakes in Data Science and How To Avoid Them
As a novice or seasoned Data Scientist, your work depends on the data, which is rarely perfect. Properly handling the typical issues with data quality and completeness is crucial, and we review how to avoid six of these common scenarios.
Photo by chuttersnap on Unsplash.
Introduction
In data science or machine learning, we use data for descriptive analytics to draw out meaningful conclusions from the data, or we can use data for predictive purposes to build models that can make predictions on unseen data. The reliability of any model depends on the level of expertise of the data scientist. It is one thing to build a machine learning model. It is another thing to ensure the model is optimal and of the highest quality. This article will discuss six common mistakes that can adversely influence the quality or predictive power of a machine learning model with several case studies included.
6 Common Mistakes in Data Science
In this section, we discuss six common mistakes that can severely impact the quality of a data science model. Links to several real applications are included.
 We often assume that our dataset is of good quality and reliable
Data is key to any data science and machine learning task. Data comes in different flavors such as numerical data, categorical data, text data, image data, voice data, and video data. The predictive power of a model depends on the quality of data used in building the model. It is therefore extremely important that before performing any data science task such as exploratory data analysis or building a model, you check the source and reliability of your data because even datasets that appear perfect may contain errors. There are several factors that could diminish the quality of your data:
 Wrong Data
 Missing Data
 Outliers in Data
 Redundancy in Data
 Unbalanced Data
 Lack of Variability in Data
 Dynamic Data
 Size of Data
For more information, please see the following article: Data is Always Imperfect.
From my personal experience working on an industrial data science project, my team had to work with system engineers, electrical engineers, mechanical engineers, field engineers, and technicians over a period of 3 months just to understand the available dataset and how we could use it to frame the right questions to be answered using the data. Ensuring that your data is errorfree and of high quality will help improve the accuracy and reliability of your model.
 Don’t focus on using the entire dataset
Sometimes as a data science aspirant, when you have to work on a data science project, you may be tempted to use the entire dataset provided. However, as already mentioned above, a dataset could have several imperfections, such as the presence of outliers, missing values, and redundant features. If the fraction of your dataset containing imperfections is really small, then you may simply eliminate the subset of imperfect data from your dataset. However, if the proportion of improper data is significant, then methods such as data imputation techniques could be used to approximate missing data.
Before implementing a machine learning algorithm, it is necessary to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction. Feature selection and dimensionality reduction are important because of three main reasons:
a) Prevents Overfitting: A highdimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
b) Simplicity: An overcomplex model having too many features can be hard to interpret, especially when features are correlated with each other.
c) Computational Efficiency: A model trained on a lowerdimensional dataset is computationally efficient (execution of algorithm requires less computational time).
For more information about dimensionality reduction techniques, please see the following articles:
 Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot
 Machine Learning: Dimensionality Reduction via Principal Component Analysis
Using dimensionality reduction techniques to remove unnecessary correlations between features could help improve the quality and predictive power of your machine learning model.
 Scale your data before using it for model building
Scaling your features will help improve the quality and predictive power of your model. For example, suppose you would like to build a model to predict a target variable creditworthiness based on predictor variables such as income and credit score. Because credit scores range from 0 to 850 while annual income could range from $25,000 to $500,000, without scaling your features, the model will be biased towards the income feature. This means the weight factor associated with the income parameter will be very small, which will cause the predictive model to be predicting creditworthiness based only on the income parameter.
In order to bring features to the same scale, we could decide to use either normalization or standardization of features. Most often, we assume data is normally distributed and default towards standardization, but that is not always the case. It is important that before deciding whether to use either standardization or normalization, you first take a look at how your features are statistically distributed. If the feature tends to be uniformly distributed, then we may use normalization (MinMaxScaler). If the feature is approximately Gaussian, then we can use standardization (StandardScaler). Again, note that whether you employ normalization or standardization, these are also approximative methods and are bound to contribute to the overall error of the model.
 Tune hyperparameters in your model
Using the wrong hyperparameter values in your model could lead to a nonoptimal and lowquality model. It is important that you train your model against all hyperparameters in order to determine the model with optimal performance. A good example of how the predictive power of a model depends on hyperparameters can be found in the figure below (source: Bad and Good Regression Analysis).
Figure 1. Regression analysis using different values of the learning rate parameter. Source: Bad and Good Regression Analysis, Published in Towards AI, February 2019, by Benjamin O. Tayo.
Keep in mind that using default hyperparameters will not always lead to an optimal model. For more information about hyperparameters, please see this article: Model Parameters and Hyperparameters in Machine Learning — What is the difference.
 Compare different algorithms
It is important to compare the predictive power of several different algorithms before selecting your final model. For example, if you are building a classification model, you may try the following algorithms:
 Logistic Regression classifier
 Support Vector Machines (SVM)
 Decision tree classifier
 Knearest neighbor classifier
 Naive Bayes classifier
If you are building a linear regression model, you may compare the following algorithms:
 Linear regression
 Kneighbors regression (KNR)
 Support Vector Regression (SVR)
For more information about comparing different algorithms, please see the following articles:
 Quantify random error and uncertainties in your model
Every machine learning model has an inherent random error. This error arises from the inherent random nature of the dataset; from the random nature in which the dataset is partitioned into training and testing sets during model building; or from randomization of the target column (a method used for detecting overfitting). It is important to always quantify how random error affects the predictive power of your model. This would help improve the reliability and quality of your model. For more information about random error quantification, please see the following article: Random Error Quantification in Machine Learning.
Summary
In summary, we have discussed six common mistakes that can influence the quality or predictive power of a machine learning model. It is useful to always ensure that your model is optimal and of the highest quality. Avoiding the mistakes discussed above can enable a data science aspirant to build reliable and trustworthy models.
Related:
Top Stories Past 30 Days

