9 Deadly Sins of Machine Learning Dataset Selection

Avoid endless pain in model debugging by focusing on datasets upfront.

By Sandeep Uttamchandani, Ph.D., Both a Product/Software Builder (VP of Engg) & Leader in operating enterprise-wide Data/AI initiatives (CDO)

Let’s start with an obvious fact: ML models can only be as good as the datasets that were used to build them! While there is a lot of emphasis on ML model building and algorithm selection, teams often do not pay enough attention to dataset selection!

Unsplash Image


In my experience, investing time upfront in dataset selection saves endless hours later during model debugging and production rollout.


Nine Deadly Sins of ML Dataset Selection


1. Not handling outliers in datasets properly

Based on the ML model being built, outliers can either be a noise to ignore or important to take into account. Outliers arising from collection errors are the ones that need to be ignored. Machine learning algorithms differ in their sensitivity to outliers — AdaBoost is more sensitive to outliers compared to XgBoost which is more sensitive than a decision tree that would simply count an outlier as a false classification. Proper handling of outliers requires understanding if they can be ignored as well as then picking the appropriate algorithm based on sensitivity.


2. Using Normalization instead of Standardization for scaling feature values

To bring features to the same scale, use normalization (MinMaxScaler) when the data is uniformly distributed and standardization (StandardScaler) when the feature is approximately Gaussian. Before using a dataset, verify the properties of iid, stationary (not changing over time), and ensure the same distribution during training and testing. Seasonality is often missed which is a classic violation of stationarity.


3. Not verifying for duplicates in the training dataset

Oftentimes, we have been excited by really high accuracy numbers. Double-checking often reveals that many of the examples in the test set are duplicates of examples in the training set. In such scenarios, the measurements of model generalization are non-deterministic (or meaningless). A related aspect is the randomization of the training set — without randomization, we may end up with all fall data in training and summer data in the test. This can lead to loss-epoch graphs that require unnecessary debugging.


4. Not verifying inherent dataset bias

Datasets do not capture the ultimate truth from a statistical standpoint. They only capture the attributes that the application owners required at that time for their use case. It is important to analyze datasets for bias and dropped data. Understanding the context of the dataset is supercritical. Datasets often have one or more error patterns. If these errors are random, they are less harmful to model training. But if there is a bug such that a specific row or column is systematically missing, it can lead to a bias in the dataset. For instance, device details of customer clicks are missing for Andriod users due to a bug, the dataset will be biased for iPhone user activity.


5. No unit tests for validating input data

In traditional software development projects, it is a best practice to write unit tests to validate code dependencies. In ML projects, a similar best practice needs to be applied to continuously test, verify, and monitor all the input datasets. This includes ensuring test sets yield statistically meaningful results and representative of the data set as a whole.


6. Incorrect assumptions about data attribute meaning

Data attributes are typically never documented. Prior to the big data era, data was curated before being added to the central data warehouse. This is known as schema-on-write. Today, the approach with data lakes is to first aggregate the data and then infer the meaning of data at the time of consumption. This is known as schema-on-readA related issue is the existence of multiple definitions for a given business metric i.e., lack of business metrics standardization. There can be multiple sources of truth and business definitions associated with even the simplest of the metrics. For instance, a basic metric such as “new customer count” can have different definitions depending on whether it is calculated by sales, finance, marketing, customer support teams.


7. Uncoordinated changes at the data source

Schema changes at the source are often uncoordinated with downstream processing teams. The changes can range from schema changes (breaking existing pipelines) to difficult to detect sematic changes to the data attributes (very ugly when your model unexpectedly starts going nuts!). Also, when business metrics change, there is a lack of versioning of the definitions.


8. Using non-representative data

Data has an expiry date. Records of customer behavior from 10 years back may not representative. Additionally, ensuring data is IID (Independent and Identically Distributed) for model training, as well as taking into account the seasonality of data. Also, datasets are constantly evolving. Analysis of the data distribution is not a one-time activity required only at the time of model creation. Instead, there is a need to continuously monitor datasets for drifts, especially for online training. Oftentimes, given the siloed nature of data, different datasets are managed and cataloged by different teams. A lot of tribal knowledge is used to locate datasets. Without the right due diligence, teams jump at using the first available dataset. They often make the classic mistake of assuming that all datasets are equally reliable. Some of them are updated and managed by source teams very closely while other datasets are abandoned or not regularly updated or have flaky ETL pipelines.


9. Arbitrary sample selection within a large dataset

Given very large datasets, sampling is typically arbitrary. Oftentimes, teams either decide to use all the data for trainingWhile more data helps to build an accurate model, sometimes data is huge with billions of records. Training on a larger dataset takes both time and resources. Each training iteration takes longer slowing down the overall project completion. There is a need to use data sampling effectively. Paying special attention to leveraging techniques such as importance sampling.

In summary, make sure to incorporate this checklist in your dataset selection. While these steps add to the effort and potentially slow down initially, they pay for themselves multiple times later on in the ML lifecycle!

To safeguard the ML pitfalls listed in this blog, follow to get notified for the upcoming blog “The AI Checklist.” For strategies on managing Data+AI in production, checkout Unravel Data

Bio: Sandeep Uttamchandani, Ph.D.: Data + AI/ML -- Both a Product/Software Builder (VP of Engg) & Leader in operating enterprise-wide Data/AI initiatives (CDO) | O'Reilly Book Author | Founder - DataForHumanity (non-profit)

Original. Reposted with permission.