KDnuggets Home » News » 2019 » Oct » Opinions » How Data Labeling Facilitates AI Models ( 19:n42 )

How Data Labeling Facilitates AI Models


AI-based models are highly dependent on accurate, clean, well-labeled, and prepared data in order to produce the desired output and cognition. These models are fed with bulky datasets covering an array of probabilities and computations to make its functioning as smart and gifted as human intelligence.



By Nandhini TS, X-tract.io

It was 5 minutes past 7:00 in the morning. I had been working on this very piece that you’re reading for close to two hours comfortably in my sweatpants. I was so indulged in the thought process that it worked up my appetite. I headed straight to the kitchen to make some coffee and breakfast.

 

I poured some hot coffee and scooped the batter onto the griddle. As I served the pancake one on top of the other on my plate, I was all set to gobble it up.

*Spoiler Alert* Plot twist ahead.

The first bite cringed my taste buds that raised the tempo of my growling stomach. The salt overdose completely ruined my desire for those yummy pancakes.

I had generously splurged salt into the batter instead of sugar.

A simple recoverable human error, right?

I just had to do the process again with sugar. Nice.

But, had I labeled my look-alike glass jars as “sugar” and “salt”, this could’ve been avoided. The inability to comprehend an ingredient or should I say overlooking has completely changed the concept of my intro piece.

We humans have powerful senses. We have the ability to see, comprehend, analyze, react, interpret, and judge. The mistakes we make are either due to negligence or lack of awareness.

But, not the case for machines or rather, AI-based models.

Why Data Labeling is the trigger in the gun for AI-based models

AI-based models are highly dependent on accurate, clean, well-labeled, and prepared data in order to produce the desired output and cognition. These models are fed with bulky datasets covering an array of probabilities and computations to make its functioning as smart and gifted as human intelligence.

To substantiate the power of data labeling, I’m going to now walk you through some interesting scenarios.

Let’s say an AI model has to identify the age group or age range by scanning a digital photo. Now, there is a lot of information to be fed into the model, i.e multi-layered deep learning and neural network approaches supported by accurate and annotated data to achieve reasonable results.

Now, the model might need images of millions of young, middle-aged, and old people in addition to data about what wrinkled skin, youth and glowing skin, and coarse skin would look like. The model will be able to distinguish between skin types, texture, face structure, and tons of other types and factors to map it to an age range. This is possible only if these data types are labeled/annotated appropriately.

Figure

 

Data labeling basically tells the AI model to classify and assign a result to a dataset and it is considered as the core of data preparation that gives life to your AI models.

The recent sensation, the face app generates highly realistic transformations of faces in photographs by using neural networks based on artificial intelligence. The app can transform a face to make it look younger, older, or change gender, and even smile.

I recently came across an advanced and more interesting app that can detect the age of a person with their image (example discussed above). Check My Age claims to do an age detection and create awareness about the way you look and manage beauty.
Imagine the amount of data, labeling work through image labeling tool that has gone into building this to be able to reap reviews (negative reviews make it seem less creepy) like these:

Figure

 

The twin trouble: Two categories with the same name

When you’re training a dataset to identify, say, a breed of birds - then your AI data labeling process begins with labeling the bird dynasty as per their names and origins semantically. Now, there could be errors associated with the labeling process - something as simple but impactful like corrupted data (PNG and JPEG file extensions) labeled incorrectly and unbalances a number of examples in different categories.

To explain the twin trouble scenario, let’s take an example of a kite (the bird) and kite (lift and drag model).

Kite and kite

Figure

 

Visualizing clusters helps in understanding how networks interpret training models. Image classification networks usually have a penultimate layer before the softmax unit which can be used as an embedding. An embedding layout usually has spatial properties and these don’t work but for clustering the vectors.

Now, an image classification model to identify birds (kite in our case) would terribly fail and have high error rates. In order to investigate errors, clustering visualization is used. When there are results for kite (the model) instead of kite (the bird), it becomes obvious that they were labeled incorrectly.

To make the training model produce accurate results, the labeling has to be revisited by removing all the kite images from the kite (bird breed) category.

The White space Conundrum

There are a lot of issues associated with images - foggy, blurred, pixelated etc that interrupt the robustness of your training model. Annotated data should adhere to standard image or text data classification benchmarks. While for texts, it is the problem of whitespaces and punctuation.

A simple example would be using quotes while labeling datasets. Let’s say, one annotator or text labeling tool labels “pancakes” while another ”pancakes”, and another “pancakes“.

These three examples seem to be the same but for the usage of quotes. If you observe, the first one has an open and closed quote while the second one has two open quotes and the last one has two closed quotes.

This is a negligible error but would have a huge impact on your training models during interpretation. So, not just data annotation, but accurate AI data annotation bundled with quality, preparation, and enrichment ensures the robustness of your model.

What’s the ingredient of your model: Salt? Sugar?

Sometimes, we’re so engulfed with the thoughts in our subconscious mind that we overlook some facts. I failed to notice that my ingredient (salt) had smaller grains than sugar. I needed help in telling them apart and a neat white label with bold black texts that I stuck on my glass jars assures me of yummy sweet pancakes. This is the case of a small human error that ruined breakfast and the remedy is to simply redo or switch to an alternative.

But, if your AI models fail due to lack of or erroneous labeling, the impact is catastrophic and almost irreversible owing to the volumes and complexities of datasets fed in training the model. This brings us to how to do data labeling for machine learning the right way just so that it doesn’t sabotage your inventions.

Data Labeling is a labor-intensive task

Organizations need to deal with the human-dominated labor involved in data labeling, something Cognilytica has identified takes up to 25% of total machine learning project time and cost.

Human-in-loop automation to the rescue

As much as human intervention is required in the data preparation and labeling processes, automated data labeling will help in speeding up the process and there are several data preparation and data labeling tools available in the market like X-tract.io, Cloudfactory, and figure eight that has a powerful workforce and automation tools to lend you a helping hand in data annotation and labeling.

 
Bio: Nandhini TS is a product marketing associate at X-tract.io – a data solutions company. She enjoys writing about the power and influence of data for successful business operations. In her time off, she has her nose buried in growing her side hustles and binge-watching dinosaur documentaries.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy