Data-Centric AI: The Latest Research You Need to Know

While a vast majority of research efforts today are preoccupied solely with ML models and algorithms, the data itself tends to be secondary and is treated as fixed. This claim is potentially detrimental.

Data-Centric AI: The Latest Research You Need to Know
Technology photo created by -


While a vast majority of research efforts today are preoccupied solely with ML models and algorithms, the data itself tends to be secondary and is treated as fixed. This claim is potentially detrimental – there’s a big risk of favoring theory over practice as the models are becoming more divorced from the ground truth. There’s a need to combat this trend by providing incentive and information to researchers and practitioners alike to work with the data instead.


Let’s take a closer look

Two interesting ideas to examine:

  • When faced with the problem of model improvement, one should address it not by reworking the algorithms, but rather by seeking to improve data quality.
  • When working with the data, one should look at the process as a whole (which includes a careful selection of labelers), not just the technical aspect of it (for example, aggregation).

Some useful resources and contests to check out:

  • Data-Centric AI Competition was organized by Andrew Ng and his team who invited participants from all over the globe to improve the contest’s data without touching the algorithm, thereby bypassing the common model-centric approach. The dataset consisted of Roman numerals that had to be tweaked by applying any number of data-centric techniques – fixing incorrect labels, adding own data, data augmentation, etc.
  • A paper by Koch et al supported by a short talk by Koch focused on how dataset usage patterns differ across different ML subcommunities. The main conclusion of this research was that aside from NLP, researchers tend to utilize the very same datasets for different tasks. As a result, there’s an observed drop in benchmarks representing real-world data science problems.
  • MLCommons Foundation is dedicated to accelerating ML innovation and has a variety of handy materials and techniques listed on its page.
  • DataPerf is an initiative from several universities, companies, and respected researchers to create benchmarks for training sets, test sets, and a range of data-centric algorithms.

All of the articles are available on the Proceedings page here, as are Datasets and Benchmarks and the Data-Centric AI Workshop respectively.


Data quality

It’s extremely important to measure the quality of data sets; however, there’s currently no universally agreed-upon method of how to do it. While the accuracy of models can be gauged using different metrics, a reliable approach to data evaluation is yet to be found. Data-related problems can result in poor performance of the model when noisy sets are involved. At the same time, improperly labeled data can also lead to a seemingly healthy ML model learning to adopt erroneous and even potentially dangerous patterns.

A number of studies look at various dataset errors – the first step to boosting data quality:

  • Northcutt, Athalye, and Mueller studied errors in ten of the most commonly used CV, NLP, and audio datasets in order to figure out how these mistakes can affect benchmarks. The result of their efforts is Cleanlab, an open-source solution that helps identify label errors. Github link is enclosed.
  • A paper that formed the basis for a lightning talk was introduced by Kang et al. The team proposed learned observation assertions, a probabilistic technique designed to test ML pipelines.

Data collection procedures are also worth discussing. The main question is again about maintaining high quality and ideally moving towards mainstream software engineering in terms of finding universal standards and methodologies. Two interesting resources to check out:


Data collection and labeling

Another question about data collection is about having enough data variability to accurately represent less common real-world occurrences. Some of the offered solutions centered around setting data collection parameters by hand (for example, weather patterns and number of pedestrians for AV), using learning iterations to continuously add unusual cases to the set, and utilizing synthetic data. Among them are these research efforts:

  • A comprehensive talk about data collection in the context of AV by Raquel Urtasun, University of Toronto Professor and Founder of Waabi (from minute 57).
  • A research paper titled Data Augmentation for Intent Classification by Chen and Yin with a corresponding lightning talk. The researchers used mixed augmentation techniques to generate pseudo-labeled training examples from seed data within the airline and telecommunications industries.
  • A paper by Pavlichenko et al. on CrowdSpeechcontributes a big dataset to address one of the most common problems in crowdsourcing, audio transcription aggregation. The dataset contains transcripted recordings of a well-known speech recognition dataset, LibriSpeech. Interestingly, the best aggregation model was initially designed for text summarization but outperformed other models, including the task-specific ones. Also, to allow processing of other languages, the authors created a pipeline called Vox DIY that replicates a similar dataset for virtually any natural language.

The issue of supervised data labeling, inadequate instructions, and training of the labelers are also quite interesting:

Data management is another challenge the AI community is facing. How do we manage data sets embedded within complex pipelines when numerous actions and edits are required over extended periods? Is there a need for version control and keeping a changelog? These two talks focus specifically on data documentation:

  • Data Cards is a template suggested by Pushkarna and Zaldivar of Google Research. The method is used to summarize critical information about datasets. A lightning talk is available.
  • DAG Cards by Tagliabue et al is one of the latest proposals for flexible pipeline documentation. This lightning talk summarizes the logic behind it along with its main advantages.



Question of ethics is a highly contentious topic. Among some of the key aspects to bare in mind:

  • Privacy, including but not limited to GDPR guidelines.
  • Biases – avoiding dataset stereotypes that often negatively affect training models.
  • Minorities – a lack of dataset diversification with many social groups often being underrepresented or excluded altogether.
  • Little can be done to solve these problems by modifying algorithms; consequently, it’s the data-collection and data-labeling processes that need to be refined.

A solid starting point for this is to identify the causes of these problems and their frequency. Check our these three talks dedicated to this endeavor:

  • A talk by Bao et al supported by their research paper titled It’s COMPASlicated. The authors criticize Risk Assessment Instrument (RAI) data sets used by law enforcement—namely COMPAS—that are inherently biased and racist.
  • A research paper and talk by Peng, Mathur, and Narayanan that focuses on the ethically problematic facial recognition data sets and the implications of their usage by analyzing 1000 papers that cited these sets.
  • A data-centric AI workshop paper proposes a new large-scale dataset for learning from subjective human opinions on people's ages, IMDB-WIKI-SbS, which allows designing better recommender and ranking algorithms. It is worth admitting that the dataset is balanced w.r.t. age and gender groups, and it is currently difficult for the machine learning models to handle photos in all the groups equally well.

While the ethical problems are apparent, the solutions are not. One of the few research examples that attempt to make amends to an existing data set is Feminist Curation of Text for Data-Centric AI by Bartl and Leavy. Bartl’s lightning talk outlines how the researchers leveraged feminist linguistics for the asssessment and mitigation of gender biases in text data sets and language models.

Lastly, a useful paper worth checking out is titled The Disagreement Deconvolution. This paper co-authored by Bernstein focuses on how majority voting in aggregation can sometimes lead to poor results as showcased by the ROC-AUC metrics.

Alexey Umnov is a Machine Learning Consultant at Toloka and PhD.