Don’t Touch a Dataset Without Asking These 10 Questions

Selecting the right dataset is critical for the success of your AI project.



By Sandeep Uttamchandani, Ph.D., Both a Product/Software Builder (VP of Engg) & Leader in operating enterprise-wide Data/AI initiatives (CDO)

Data exploration

Data is the heart of an AI product. There is a growing emphasis on tuning the data instead of tuning the models — coined by Andrew Ng as data-centric AI. In my experience, the success or failure of an AI project can be predicted by the datasets being used. 

If you are a Data Scientist/AI Engineer looking to build a new model or a Data Engineer working on building pipelines for an AI project, for every dataset you shortlist, ask the following questions to avoid headaches and missed expectations later in the AI lifecycle.

 

1. Is the meaning of dataset attributes documented?

 
Prior to the big data era, data was curated before being added to the central data warehouse. This is known as schema-on-write. Today, the approach with data lakes is to first aggregate the data and then infer the meaning of data at the time of consumption. This is known as schema-on-read.

Data attributes are seldom documented correctly or kept up to date. While having the documentation can be seen as a step that is slowing down the project, it actually becomes extremely critical down the line during model debugging. Identify the Data Steward who owns the dataset and ensure they can provide the most accurate documentation.

 

2. Are the aggregate/derived metrics in the dataset standardized?

 
Derived data or metrics can have multiple sources of truth and business definitions. Ensure that the metrics have a clear documented business definition (sometimes implicit within the ETL)

 

3. Does the dataset comply with data rights regulations (such as GDPR, CCPA, etc)

 
Data rights regulations are now becoming critical — it is important to track and enforce these during model training and re-training. There is a growing number of data rights regulations like GDPR, CCPA, Brazilian General Data Protection Act, India Personal Data Protection Bill, and several others, as shown in Figure. These laws require customer data to be collected, used, and deleted based on their preferences. There are different aspects of data rights, namely: Collection of data rights, Use of data rights, Deletion of data rights, Access to data rights.

 

4. Is there a clear change management process such that dataset schema/definition changes will be notified to all consumers?

 
It’s very common that schema changes at the source are uncoordinated with downstream processing. The changes can range from schema changes (breaking existing pipelines) to difficult to detect sematic changes to the data attributes. Also, when business metrics change, there is a lack of versioning of the definitions.

 

5. What is the context in which the dataset was collected?

  
Datasets seldom capture the ultimate truth from a statistical standpoint. They only capture the attributes that the application owners required at that time for their use case. It is important to analyze datasets for bias and dropped data. Understanding the context of the dataset is supercritical.

 

6. Is the data IID? 

  
The implicit assumption of model training is that the data is IID (Independent and Identically Distributed). Also, data has an expiry date. Records of customer behavior from 10 years back may not representative. 

 

7. Is the dataset tested/validated for systematic errors in data collection?

 
If errors in the dataset are random, they are less harmful to model training. But if there is a bug such that a specific row or column is systematically missing, it can lead to a bias in the dataset. For instance, device details of customer clicks are missing for a user category due to a bug, the dataset will be non-representative of reality.

 

8. Is the dataset monitored for sudden distribution changes?

 
Datasets are constantly evolving. Analysis of the data distribution is not a one-time activity required only at the time of model creation. Instead, there is a need to continuously monitor datasets for drifts, especially for online training.

 

9. How are outliers handled in the dataset?

 
Outliers are not necessarily bad and are sometimes essential to correctly build the model. It is important to understand if the outliers are being filtered during collection and what is the logic/criteria. 

 

10. Does the dataset have an assigned Data Steward? (applicable for larger sized teams)

  
Datasets are useless if they cannot be understood. Trying to reverse engineer the meaning of columns is often a ‘losing battle.” The key is to ensure that there is a Data Steward responsible for a dataset to update and evolve the documentation details.

 
In my experience, the answer to these questions helps proactively uncover known knowns, known unknowns, and unknown unknowns in the dataset. It is not important that each of the questions has an affirmative answer. Rather, taking these responses into account can speed up the AI lifecycle and help avoid blind spots.

 
Bio: Sandeep Uttamchandani, Ph.D.: Data + AI/ML -- Both a Product/Software Builder (VP of Engg) & Leader in operating enterprise-wide Data/AI initiatives (CDO) | O'Reilly Book Author | Founder - DataForHumanity (non-profit)

Related: