Crowdsourcing vs. Managed Teams: A Study in Data Labeling Quality

You need data labeled for ML. You can do it in-house, crowdsource it, or hire a managed service. If data quality matters, read this.

Sponsored Post.
By CloudFactory

Header image

Download the full report for the methodology, details about cost, and study results.

If you’re like most dev teams, you’re doing data labeling work in-house, and it’s the bulk of the work. Cognilytica found 80% of AI project time is spent on aggregating, cleaning, labeling, and augmenting data to be used in ML models. That leaves just 20% for the activities that drive strategic value: algorithm development, model training and tuning, and ML operationalization. It’s hard to innovate and accelerate deployments when you’re spending so much time on tasks that can be effectively offloaded.

You can flip that dynamic by deploying people strategically in a virtual data production line, but like any well executed strategy, there will be important tradeoffs to consider. Depending on the question you want your data to answer, you could use crowdsourcing or a managed service. Each workforce option comes with advantages and disadvantages. Data science platform developer Hivemind designed a study to understand these dynamics in more detail.

Anonymous Crowdsourcing vs. Managed Team
Hivemind hired a managed workforce and a leading crowdsourcing platform’s anonymous workers to complete a series of the same data labeling tasks, ranging from basic to more complicated, to determine which team delivered the highest-quality structured datasets and at what relative cost.


Key Takeaways from the Study

Task A: Easy Transcription
Workers were asked to open a PDF, locate three trade numbers, and transcribe them. The crowdsourced workforce transcribed at least one of the numbers incorrectly in 7% of cases. When compensation was doubled, this error rate fell to just under 5%. The managed workers only made a mistake in 0.4% of cases.

Cloudfactory image

Task B:  Sentiment Analysis
Workers were presented with the text of a company review from a review website and asked to rate the sentiment of the review from one to five. Actual ratings (ground truth) were removed. Managed workers had consistent accuracy, getting the rating correct in about 50% of cases. The crowdsourced workers seemed to have a problem, particularly with poor reviews. Their accuracy was almost 20%, essentially the same as guessing, for 1- and 2-star reviews. For 4- and 5-star reviews, there was little difference between the workforce types.

Task C:  Extracting Information from Unstructured Text
Workers were given a title and description of a product recall and asked to classify the recall by hazard type using a drop-down menu of 11 options, including “other” and “not enough information provided.” The crowdsourced workers achieved accuracy of 50% to 60%, regardless of the recall’s word count. The managed workers achieved higher accuracy, 75% to 85%.

Cloudfactory image

Download the full report for the methodology, details about cost, and study results.