5 Essential Papers on AI Training Data

Data pre-processing is not only the largest time sink for most Data Scientists, but it is also the most crucial aspect of the work. Learn more about training data and data processing tasks from 5 leading academic papers.

Many data scientists claim that around 80% of their time is spent on data preprocessing, and for good reasons, as collecting, annotating, and formatting data are crucial tasks in machine learning. This article will help you understand the importance of these tasks, as well as learn methods and tips from other researchers.

Below, we will highlight academic papers from reputable universities and research teams on various training data topics. The topics include the importance of human annotators, how to create large datasets in a relatively short time, ways to securely handle training data that may include private information, and more.


1. How Important are Human Annotators?


This paper presents a firsthand account of how annotator quality can greatly affect your training data, and in turn, the accuracy of your model. In this sentiment classification project, researchers from the Jožef Stefan Institute analyze a large dataset of sentiment-annotated tweets in multiple languages. Interestingly, the findings of the project state that there was no statistically major difference between the performance of the top classification models. Instead, the quality of the human annotators was the larger factor that determined the accuracy of the model.

To evaluate their annotators, the team used both inter-annotator agreement processes and self- agreement processes. In their research, they found that while self-agreement is a good measure to weed out poor-performing annotators, an inter-annotator agreement can be used to measure the objective difficulty of the task.

Research PaperMultilingual Twitter Sentiment Classification: The Role of Human Annotators

Authors / Contributors: Igor Mozetic, Miha Grcar, Jasmina Smailovic (all authors from the Jozef Stefan Institute)

Date Published / Last Updated: May 5, 2016


2. A Survey On Data Collection for Machine Learning


From a research team at the Korean Advanced Institute of Science and Technology, this paper is perfect for beginners looking to get a better understanding of the data collection, management, and annotation landscape. Furthermore, the paper introduces and explains the processes of data acquisition, data augmentation, and data generation.

For those new to machine learning, this paper is a great resource to help you learn about many of the common techniques to create high-quality datasets used in the field today.

Research PaperA Survey on Data Collection for Machine Learning

Authors / Contributors: Yuji Roh, Geon Heo, Steven Euijong Whang (all authors from KAIST)

Date Published / Last Updated: August 12, 2019


3. Using Weak Supervision to Label Large Volumes of Data


For many machine learning projects, sourcing and annotating large datasets takes up substantial amounts of time. In this paper, researchers from Stanford University propose a system for the automatic creation of datasets through a process called “data programming”.

The above table was taken directly from the paper and shows precision, recall, and F1 scores using data programming (DP) in comparison to the distant supervision ITR approach.

The proposed system employs weak supervision strategies to label subsets of the data. The resulting labels and data will likely have a certain level of noise. However, the team then removes noise from the data by representing the training process as a generative model, and presents ways to modify a loss function to ensure it is “noise-aware.”

Research PaperData Programming: Creating Large Training Sets, Quickly

Authors / Contributors: Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré (all authors from Stanford University)

Date Published / Last Updated: January 8, 2017


4. How to Use Semi-supervised Knowledge Transfer to Handle Personally Identifiable Information (PII)


From researchers at Google and Pennsylvania State University, this paper introduces an approach to dealing with sensitive data such as medical histories and private user information. This approach, known as Private Aggregation of Teacher Ensembles (PATE), can be applied to any model and was able to achieve state-of-the-art privacy/utility trade-offs on the MNIST and SVHN datasets.

However, as Data Scientist Alejandro Aristizabal states in his article, one major issue with PATE is that the framework requires the student model to share its data with the teacher models. In this process, privacy is not guaranteed. Therefore, Aristizabal proposes an additional step that adds encryption to the student model’s dataset. You can read about this process in his article, Making PATE Bidirectionally Private, but please make sure you read the original research paper first.

Research PaperSemi-Supervised Knowledge Transfer for Deep Learning From Private Training Data

Authors / Contributors: Nicolas Papernot (Pennsylvania State University), Martin Abadi (Google Brain), Ulfar Erlingsson (Google), Ian Goodfellow (Google Brain), Kunal Talwar (Google Brain)

Date Published / Last Updated: March 3, 2017


5. Advanced Data Augmentation for Semi-supervised Learning and Transfer Learning


One of the largest problems facing data scientists today is getting access to training data. It can be argued that one of the biggest problems of deep learning is that most models require large amounts of labeled data in order to function with a high degree of accuracy. To help combat these issues, researchers from Google and Carnegie Mellon University have come up with a framework for training models on substantially lower amounts of data.

The team proposes using advanced data augmentation methods to efficiently add noise to unlabeled data samples used in semi-supervised learning models. Amazingly, this framework was able to achieve incredible results. The team states that on the IMDB text classification dataset, their method was able to outperform state-of-the-art models by training on only 20 labeled samples. Furthermore, on the CIFAR-10 benchmark, their method outperformed all previous approaches.

Research PaperUnsupervised Data Augmentation for Consistency Training

Authors / Contributors: Qizhe Xie (1,2), Zihang Dai (1,2), Eduard Hovy (2), Minh-Thang Luong (1), Quoc V. Le (1) (1 - Google Research, Brain Team, 2 - Carnegie Mellon University)

Date Published / Last Updated: September 30, 2019