7 Ways to Get High-Quality Labeled Training Data at Low Cost
Having labeled training data is needed for machine learning, but getting such data is not simple or cheap. We review 7 approaches including repurposing, harvesting free sources, retrain models on progressively higher quality data, and more.
Data scientists know that an untrained statistical model is next to useless. Without high-quality labeled training data, supervised learning falls apart and there is no way to ensure that models can predict, classify, or otherwise analyze the phenomenon of interest with any accuracy.
When you’re doing supervised learning, it’s best not to develop a model if there’s no possibility of finding the right training data. Even if you’ve found a suitable training dataset, it’s not good for much if its entries haven’t been labeled, tagged, or annotated to train your machine-learning algorithm effectively.
However, labeling is a thankless job that few data scientists will do for any reason other than brute necessity. In the prestige pecking order of data science jobs, labeling training data is near the bottom. Labeling has acquired the (perhaps unfair) reputation as the low-skilled “blue collar” job in the data science ecosystem. Or, as depicted in this hilarious episode from the latest season of HBO’s series “Silicon Valley,” the labeling of training data is a chore that an unscrupulous data scientist might try to bamboozle unwitting young college students into doing for no compensation.
All of this suggests, unfairly, that data scientists can’t acquire acceptable training data unless they outsource the labeling function to the high-tech equivalent of a sweatshop. This is an unfortunate perception, because, as I noted last year in this KDnuggets column "Pattern Curators of the Cognitive Era", labeling may rely on the judgments of highly skilled subject matter experts (e.g., oncologists assessing whether biopsies indicate cancerous tissue) just as often as it leans on mundane assessments that any of us could perform (e.g, the fictional “hot dog/not hot dog” example alluded to above).
Sweatshopping is far from the only approach for acquiring and labeling training data, as noted in this recent Medium post. As author Rasmus Rothe notes, there are other approaches that will produce labeled training data at a cost that won’t necessarily bust your data-science budget. What follows is my summary of these approaches:
- Repurposing existing training data and labels: This may be the cheapest, easiest, and fastest approach for training, if we assume that the new learning task’s domain is sufficiently similar to the domain of the original task. When taking this approach, “transfer learning” tools and techniques may help you determine which elements of the source training dataset are repurposable to the new modeling domain.
- Harvest your own training data and labels from free sources: The Web, social media, and other online sources are brimming with data that can be harvested if you have the right tools. In this era of cognitive computing, you can in fact acquire rich streams of natural language, social sentiment, and other training data from the various sources that I highlighted in this Dataversity column from late last year. If you have access to a data crawler, this might be a good option for acquiring training datasets--as well as the associated labels--from source content and metadata. Clearly, you’ll need to grapple with a wide range of issues related to data ownership, data quality, semantics, sampling, and so forth when trying to assess the suitability of crawled data for model training.
- Explore pre-labeled public datasets: There is a wealth of free data available in open-source communities and even from various commercial providers. Data scientists should identify which if any of this data might be suitable at least for the initial training of their models. Ideally, the free dataset should have been pre-labeled in a way that is useful for your learning task. If it hasn’t been pre-labeled, you will need to figure out the most cost-effective way of doing so.
- Retrain models on progressively higher quality labeled datasets: Your own data resources may be insufficient for training your models. To bootstrap training, you might pretrain with free public data that is roughly related to your domain. If the free datasets include acceptable labels, all the better. You might then retrain the model on smaller, higher quality, labeled datasets that are directly related to the learning task you’re attempting to address. As you progressively retrain your model on higher-quality datasets, the findings might allow you to fine-tune the feature engineering, classes, and hyperparameters in your model. This iterative process that might suggest other, higher quality datasets you should acquire and/or higher-quality labeling that should be done in future training rounds in order to refine your model even further. Bear in mind, though, that these iterative refinements might require progressively pricier training datasets and labeling services.
- Leverage crowdsourcing labeling services: You might not have enough internal staff to label your training data. Or your staff might be unavailable or too expensive for you use for labeling. Or your staff human resources might be insufficient to label a huge amount of training data rapidly enough. Under those circumstances, and budget permitting, you might crowdsource labeling chores to commercial services such as Amazon Mechanical Turk or CrowdFlower. Outsourcing the task of labeling to crowd-oriented environments can be far more scalable to doing it internally, though you give up some control over the quality and consistency of the resultant labels. On the positive side, these services tend to use high-quality labeling tools that make the process faster, more precise, and more efficient than you may be able to manage with in-house processes.
- Embed labeling tasks in online apps: Human cognition is a boundless resource on the Internet of you’re clever enough to leverage it for labeling tasks. For example, embedding of training data in CAPTCHA challenges, which are common in two-factor authentication scenarios, is a popular approach for training image and text recognition models. In a similar vein, you might consider presenting training data in gamified apps that provide incentives to users to identify, classify, or otherwise comment on images, text, objects, and other presented entities.
- Rely on third-party models that have been pretrained on labeled data: Many learning tasks have already been addressed by good-enough models that have already been trained with good-enough datasets, which, presumably were adequately labeled prior to the training of the corresponding models. Pretrained models are available from various sources, including academic researchers, commercial vendors, and open-source data-science communities. Bear in mind that the utility of these models will decline if your learning task’s domain, feature set, and learning task drifts further from the source over time.
Keeping models fit for purpose depends intimately on the availability of training data, the need for frequent retraining, the availability of labeling resources, and so on. Clearly, there is no one approach that fits all requirements for acquiring and labeling training data sets.
The complex decisions that data scientists must make in this regard introduce risks and fragility into the lifecycle of a supervised learning application. As I noted in this recent Wikibon blog, how you choose to train your algorithms introduces an ongoing maintenance burden to whatever downstream applications consume your analytical model’s outputs.