Pattern Curators of the Cognitive Era

Machine learning has a critical dependency on human learning. But not just on Data Scientists, but on legions of people who legions of individuals who prepare training data to guide algorithms.

Machine learning has a critical dependency on human learning.

I'm not referring to data scientists, a class of learned humans who play an undeniably pivotal role in this new era. What I'm referring to are the legions of individuals who prepare training data to guide algorithms in their search for patterns of interest. Once the target patterns have been tagged and flagged by humans in the know, machine learning and other artificial intelligence (AI) algorithms can work their magic.

Does it make sense to demean this job category that’s essential to the cognitive era? That’s how I interpreted the thrust of this recent article, “Is 'data labeling' the new blue-collar job of the AI era?” As the title indicates, author Hope Reese builds a case for this category of workers—what she called “data labelers” but I prefer to think of as “pattern curators”—as low-skilled drones locked into a 24x7 industrial process that feeds labeled training data into the algorithmic maw of the new economy. As she notes, the fact that more of this function is being outsourced to crowdsourcing environments would seem to indicate that it’s a dead-end minimum-wage job (if that). And that would imply that it can be safely offshored to sweatshops across the developing world.

But is that a fair characterization of this trend? After all, we’re talking about one of many processes in the discovery, refinement, preparation, modeling, and exploration lifecycle of data-driven AI projects. The preponderance of unstructured data sources in these data-science projects places a premium on a combination of automated and manual resources for discovering, preparing, tagging, and contextualizing it all. Data can acquire structure at many points in the pipeline, from the moment it’s created all the way through data acquisition, integration, transformation, preparation, modeling, analysis, query, visualization, and usage. These functions involve people of many skill levels exercising judgment, engaging in creative collaborations, and other doing high-value work that, even as it’s automated, can’t neatly be reduced to brainless industrial-scale drudgery.

In that regard, Reese cites Guru Banavar, a member of IBM’s Watson team, as stating that this new order involves people of diverse skills. He refers to the labeling of data sets as a type of curation in which human specialists play an indispensable role in organizing data for machines to ingest.

Data-driven pattern curation is essential to the dominant machine-learning approach called "supervised learning," in which humans label data examples so that adaptive classifier algorithms can process fresh data more accurately and efficiently. This approach enables machine-learning models to sense patterns without being explicitly programmed, but it grinds to a dead halt without training data in which the patterns of interest have been curated. In other words, it requires that the training data be manually reviewed and tagged by people whose judgments may range from the seemingly imbecilic (e.g., “does that image show a cat?”) to the exquisitely sensitive (e.g., “does that image show a brain tumor?”).

Yes, of course, a fair amount of higher-order pattern curation is being automated through algorithmic tools, as I discussed in this recent post, in reference to its role in unsupervised learning. But it takes skilled personnel to model, build, tune, and maintain machine-learning algorithms of this or any other sort. And you best believe that these functions aren’t in danger of being offshored to sweatshops in the foreseeable future. In fact, anywhere in the world that people engage in high-volume data-pattern curation, you’re likely to find the entire range of data-science jobs, including those that demand highly educated talent.

In this blog from December 2014, I discussed this in the broader context of function that depend on subject matter experts adept in  the discovery, review, refinement, analysis, categorization, tagging, contextualization, and recommendation of data that might be relevant to downstream uses. One such use might be for  training algorithms that can detect patterns of customer sentiment or fraud as well as most humans.

For these reasons, I think it’s unfair to characterize data labeling--or, rather, data-driven pattern curation--as a low-skilled dead-end job. In fact, my sense is that, in the majority of data science projects, it—and the bulk of the other data engineering tasks—are performed by the statistical modelers or domain experts who are driving those initiatives.

In high-value data-science initiatives, pattern curation is too reliant on human judgment to farm out to just anyone.