The Rise of User-Generated Data Labeling
Let’s say your project is humongous and needs data labeling to be done continuously - while you’re on-the-go, sleeping, or eating. I’m sure you’d appreciate User-generated Data Labeling. I’ve got 6 interesting examples to help you understand this, let’s dive right in!
By Nandhini TS, X-tract.io
Cheetah uses supervised learning techniques to catch its prey. That’s a bizarre, random out-of-the-blue statement you may say. But, think about it. A cheetah has adapted a very refined approach to hunting by honing its skills through practice, observation, experience, and computation.
Much like training datasets to create a spectacular AI model. They’re trained and taught continuously until they’re able to operate on their own. The marvelous cheetah species too goes through a similar process until it can anticipate the escape tactics of various prey and modulate its speed for rapid turns - and not just rely on its agility and speed. Cognition is achieved through immense training and the core of this process is Data Labeling.
This is an essential prerequisite that helps your machine learning algorithms to “learn” based on the labeled input. Now, there are several ways to do this - self-managed human labor, outsource to individuals/companies, third-party managed labeling providers, and more.
But, let’s say your project is humongous and needs data labeling to be done continuously - while you’re on-the-go, sleeping, or eating. That’s when you need to get it done for free. Of course, it can be outsourced, but if you consider the cost, probabilities covered, and accuracy achieved, I’m sure you’d appreciate User-generated Data Labeling.
I’ve got 6 interesting examples to help you understand this, let’s dive right in!
1. Netflix annotates thumbnail images, did you know?
A simple application of data science on platforms like Netflix would, of course, be how their recommendation engines work with implicit data. Let’s say a user “A” binge-watched a show, say, “Jane the Virgin” (all seasons in 4 days), the implicit data is that you liked the show because you obviously sacrificed a lot of sleep to watch it. Behavioral data combined with thousands of other data points is the basis on which the machine learning algorithm at Netflix actually works.
Todd Yellin, Netflix’s vice president of product innovation says, they consider data on "What we see from those profiles is the following kinds of data – what people watch, what they watch after, what they watch before, what they watched a year ago, what they’ve watched recently and what time of day".
So, if you watched Jane the Virgin, Netflix’s ML algorithm is likely to consider what people who watched this show watched after, analyzes the trends in the community preferences, if they like strong female leads or if they appreciate comedy, corrupted cops, mysterious murders, and more.
Now that Netflix classifies and recommends shows and movies based on similar interests, it further goes a step higher (to improve the click-through rates) with a concept called personalization of thumbnails. These are basically images annotated by Netflix from the video frames in a movie or a show.
So where is user-generated data labeling here? Exactly where Netflix collects different thumbnail images and annotates them based on the user’s past behavior, preference towards a particular genre, filters and lighting, favorite stars, and more. This recommendation is unique to every user which is based on thousands of similar interests that have helped improve the click-through rates. It’s brilliant how Netflix sneaks into collecting data from users and effectively utilizing it to improve the experience.
2. Time to play a game, heard of Quickdraw?
Quickdraw by Google is the largest doodling dataset with a collection of 50 million drawings across 345 categories. It is a game where you draw and the neural network tries to guess it. But, the conception of an image is very subjective for every individual which is why the neural network may not always be right.
Here’s the catch. To make the predictions accurate, the model learns as you play with it.
To help you better understand: I was asked to draw a moon. As embarrassed as I am about my drawing skills, I’m going to show you what I drew just to help you with this concept!
Well, the image clearly says the neural network didn’t recognize it (I’m not a Picasso, right?) Anyway, the game gets interesting only when the user feels victorious when the neural network recognizes how they conceive and represent what’s been asked to draw. That precisely is the success of this concept. But, owing to millions of different possibilities and deal with the skills of people like myself who’s fit to play Pictionary (drawing in air), the model utilizes user-generated data.
Now, the poor moon drawing of mine gets fed as a training dataset labeled as “moon” to help improve the accuracy. Here’s a shot of what the neural network has recognized as the moon and how it learned to recognize it.
3. Grammarly needs your help to let you know if you’re right
All writers are familiar with Grammarly. It is a tool that helps you keep your writing free of grammar errors including spelling, punctuation, and more. Grammarly is not all about grammatical errors, it also helps you identify plagiarism, poor word choices, slang, tone detection, etc. Grammarly uses a sophisticated artificial intelligence system with a team of computation linguists and deep learning engineers behind the scenes. The algorithms learn the rules of writing and make suggestions by analyzing from research corpora which are a huge collection of labeled text for research and development.
So, we all know that the fundamental of any AI model is continuous learning. Like how humans get smarter with enriching their knowledge, so do machines! But, machines work based on rules. If there is a negation, they report an error. Likewise, Grammarly gets smarter with your help (user). Let me take an example of the previous line you just read. I thought I had made a mistake seeing a red underline for “is”(see below).
But, it might logically and grammatically be right to use “are”. But, beyond grammar, writing needs to connect, most importantly your sentences need to read well. In this example, I have used “is” because the process goes on and hence the present continuous tense. However, Grammarly had other ideas. So, as a user, I helped Grammarly label this to better fit the context and ignored the suggestion to make it a little smarter.
4. Some products from the Google Suite, they’re the best, right?
a. How Gmail helps you auto-complete
Google’s Smart Compose obviously works on Artificial Intelligence. The exact function is by the use of bag-of-words- language model with recurrent neural networks. It basically uses subject line and previous emails as learning and these are encoded as word embedding and converted into vectors. Here’s an example and there’s Grammarly detecting the tone!
b. Google maps are getting smarter
Let’s say a Non-English speaker uses Google map (voice) to track the location and the app returns incorrect information as a result. When you quickly resort to pressing the back button, the model learns that there has been an error that becomes learning. The reason as we know the pronunciation and accents vary across the region. These datasets would help the ML model perform the application of voice to text conversion accurately.
c. The least favorite image captcha
We all have experienced a gazillion set of images that Google requires you to confirm it to a requirement in order to tell humans and robots apart. Due to the increasing bypass of this captcha through bots, they’ve now gotten a little complex. In order to improve the accuracy of the image classification, the user participation goes in as a training dataset to the machine learning model to improve its accuracy over time.
5. Instagram can auto-detect and remove abusive comments
NLP helps a machine understand a language, the way a human does. As for Instagram, a sentence filled with only neutral words could still be offensive while a sentence filled with swear words could be a popular song. It’s complicated. But, not for Instagram that uses DeepText to automatically identify and remove offensive comments.
In a survey conducted by Ditch the Label, 42% of more than 10,000 UK youth between ages 12 and 25 reported Instagram was the platform where they were most bullied.
Deeptext was trained by humans to identify and tag what is offensive and what is not in different contexts to understand offensive language. But, there is still the risk of misclassifying something as offensive when it’s not.
Kevin Systrom (CEO of Instagram) says, “The whole idea of machine learning is that it’s far better about understanding those nuances than any algorithm has in the past, or than any single human being could,”.
“And I think what we have to do is figure out how to get into those gray areas and judge the performance of this algorithm over time to see if it actually improves things. Because, by the way, if it causes trouble and it doesn’t work, we’ll scrap it and start over with something new.”
This element of risk is why Instagram recently came up with an option to notify the user if they are sure to post the comment with an AI-powered technology that prompts for certainty and the tests actually encouraged them to back-off.
For an AI model to work this smart (classify what’s offensive or not) it sure needs continuous training based on user contribution, don’t you think?
6. Smartbasket - The smartest launch by Bigbasket, indeed
Bigbasket, an online grocery store in India leverages AI-based technologies to improve customer experience.
Subramanai, Head of Analytics, Bigbasket said in an interview, that “Given this context, our AI journey so far has focused primarily on ML. We are now at a stage where we are going deeper into AI and exploring related technologies such as IoT, Computer Vision and Advanced Analytics, to grow the business.”
AI is used in multiple cases at Bigbasket, one for instance is how they analyze current traffic data and map it with time-to-deliver commitment. The other is the Smartbasket - this is super-interesting. Smartbasket was introduced with an intention to create personalized shopping basket for customers.
“Often 90% of the customer’s final order will be already part of Smartbasket. Our data shows that these customers now spend only half the time they normally would to place the order, `` the Data Analytics head says.
Their ML algorithm is also getting smarter wherein customers are adding 40% of the recommended products to their cart.
So, if we get deeper into this idea of recommendation engines and Smartbasket concepts, the data collected from the customer - behavior analytics, preferences, purchase history, community behavior towards the same product, and thousands of other information goes in as a training feed to the recommendation engine. The clicks improve as the mode gets smarter with more training that is based on the user response to the recommendation type, each time.
There are three types of recommender systems - collaborative filtering, content-based, and hybrid recommendation. Companies like Bigbasket and Netflix leverage hybrid recommendation i.e., considering unique user behavior as well as patterns of similar users. In a nutshell, here again, users play a huge role in making the recommendation engines smarter!
Large user-base companies use their customers to help them label their data and it is so contextually-fitting that we as users, don’t even realize it. While for models like recommendation engines on online stores like Bigbasket and media services like Netflix, the user contribution in labeling data and making the models smarter is not so apparent as there is a lot of additional analytical work that goes into the process.
But for the one like Quick draw, the labeling is outright simple. If the neural network fails, the data gets immediately fed as learning. These instances are common for huge companies that deal with gazillion information in real-time. But, for other AI projects, where data labeling is still important, you can resort to some good data annotation tools available in the market like - X-tract.io, Cloudfactory, and figure eight.
Bio: Nandhini TS is a product marketing associate at X-tract.io – a data solutions company. She enjoys writing about the power and influence of data for successful business operations. In her time off, she has her nose buried in growing her side hustles and binge-watching dinosaur documentaries.
- How Data Labeling Facilitates AI Models
- Data Preparation for Machine learning 101: Why it’s important and how to do it
- Fantastic Four of Data Science Project Preparation