How to Organize Data Labeling for Machine Learning: Approaches and Tools

The main challenge for a data science team is to decide who will be responsible for labeling, estimate how much time it will take, and what tools are better to use.

Synthetic labeling

This approach entails generating data that imitates real data in terms of essential parameters set by a user. Synthetic data is produced by a generative model that is trained and validated on an original dataset.

There are three types of generative models: Generative Adversarial Networks (or GANs), Autoregressive models (ARs), and Variational Autoencoders (VAEs).

Generative Adversarial Networks. GAN models use generative and discriminative networks in a zero-sum game framework. The latter is a competition in which a generative network produces data samples, and a discriminative network (trained on real data) tries to define whether they are real (came from the true data distribution) or generated (came from the model distribution). The game continues until a generative model gets enough feedback to be able to reproduce images that are indistinguishable from real ones.

Autoregressive models. AR models generate variables based on linear combination of previous values of variables. In the case of generating images, ARs create individual pixels based on previous pixels placed above and to the left of them.

Variational Autoencoders. VAEs produce new data samples from input through encoding and decoding methods.

Synthetic data has multiple applications. It can be used for training neural networks — models used for object recognition tasks. Such projects require specialists to prepare large datasets consisting of text, image, audio, or video files. The more complex the task, the larger the network and training dataset. When a huge amount of work must be completed in a short time, generating a labeled dataset is a reasonable decision.

For instance, data scientists working in fintech use a synthetic transactional dataset to test the efficiency of existing fraud detection systems and develop better ones. Also, generated healthcare datasets allow specialists to conduct research without compromising patient privacy.

Advantages

Time and cost savings. This technique makes labeling faster and cheaper. Synthetic data can be quickly generated, customized for a specific task, and modified to improve a model and training itself.

The use of non-sensitive data. Data scientists don’t need to ask for permission to use such data.

Disadvantages

The need for high-performance computing. This approach requires high computational power for rendering and further model training. One of the options is to rent cloud servers on Amazon Web Services (AWS), Google’s Cloud Platform, Microsoft Azure, IBM Cloud, Oracle, or other platforms. You can go another way and get additional computational resources on decentralized platforms like SONM.

Data quality issues. Synthetic data may not fully resemble real historical data. So, a model trained with this data may require further improvement through training with real data as soon as it’s available.

Data programming

Managing approaches and tools we described above require human participation. However, data scientists from the Snorkel project have developed a new approach to training data creation and management that eliminates the need for manual labeling.

Known as data programming, it entails writing labeling functions — scripts that programmatically label data. Developers admit the resulting labels can be less accurate than those created by manual labeling. However, a program generated noisy dataset can be used for weak supervision of high-quality final models (such as those built in TensorFlow or other libraries).

A dataset obtained with labeling functions is used for training generative models. Predictions made by a generative model are used to train a discriminative model through a zero-sum game framework we mentioned before.

So, a noisy dataset can be cleaned up with a generative model and used to train a discriminative model.

Advantages

Reduced need for manual labeling. The use of scripts and a data analysis engine allows for automation of labeling.

Disadvantages

Lower accuracy of labels. The quality of a program labeled dataset may suffer.

Dataset labeling tools

A variety of browser- and desktop-based labeling tools are available off the shelf. If the functionality they offer fits your needs, you can skip costly and time-consuming software development and choose the one that’s best for you.

Some of the tools include both free and paid packages. A free solution usually offers basic annotation instruments, a certain level of customization of labeling interfaces, but limits the number of export formats and images you can process during a fixed period. In a premium package, developers may include additional features like APIs, a higher level of customization, etc.

Image and video labeling

Let’s start with some of the most commonly used tools aimed at the faster, simpler completion of machine vision tasks.

Annotorious. Annotorious is the MIT-licensed free web image annotation and labeling tool. It allows for adding text comments and drawings to images on a website. The tool can be easily integrated with only two lines of additional code. Users can learn about the tool’s features and complete various annotation tasks in the Demos section.

Demo where a user can make a rectangular selection by dragging a box and saving it on an image

Just the Basics demo shows its key functionality — image annotation with bounding boxes. OpenLayers Annotation explains how to process maps and high-resolution zoomable images. With the beta OpenSeadragon feature, users can also label such images by using Annotorious with the OpenSeadragon web-based viewer.

Developers are working on the Annotorious Selector Pack plugin. It will include image selection tools like polygon selection (custom shape labels), freehand, point, and Fancy Box selection. The latter tool allows users to darken out the rest image while they drag the box.

Annotorious can be modified and extended through a number of plugins to make it suitable for a project’s needs.

Developers encourage users to evaluate and improve Annotorious, then share their findings with the community.

LabelMe. LabelMe is another open online tool. Software must assist users in building image databases for computer vision research, its developers note.

When we talk about an online tool, we usually mean working with it on a desktop. However, LabelMe developers also aimed to deliver to mobile users and created the same name app. It’s available on the App store and requires registration.

Two galleries — the Labels and the Detectors — represent the tool’s functionality. The former is used for image collection, storage, and labeling. The latter allows for training object detectors able to work in real time.

Users can also download the MATLAB toolbox that is designed for working with images in the LabelMe public dataset. Developers encourage users to contribute to a dataset. Even a small input counts, they say.

The tool’s desktop version with labeled image from the dataset

Sloth. Sloth is a free tool with a high level of flexibility. It allows users to label image and video files for computer vision research. Face recognition is one of Sloth’s common use cases. So, if you need to develop a software able to track and exactly identify a person from surveillance videos or to define whether he or she has appeared in recordings before, you can do it with Sloth.

Users can add an unlimited number of labels per image or video frame where every label is a set of key-value pairs. The possibility of adding more key-value pairs allows for more detailed file processing. For example, users can add a key “type” that differentiates point labels from the labels for the left or right eye.

Sloth supports various image selection tools, such as points, rectangles, and polygons. Developers consider the software a framework and a set of standard components. It follows that users can customize these components to create a labeling tool that meets their specific needs.

VoTT. Visual Object Tagging Tool (VoTT) by Windows allows for processing images and videos. Labeling is one of the model development stages that VoTT supports. This tool also allows data scientists to train and validate object detection models.

Users set up annotation, for example, make several labels per file (like in Sloth), and choose between square or rectangle bounding boxes. Besides that, the software saves tags each time a video frame or image is changed.

Other tools worth checking out include Labelbox, Alp’s Labeling Tool, Comma Coloring, imglab, Pixorize, VGG Image Annotator (VIA), Demon image annotation plugin, FastAnnotationTool, RectLabel, and ViPER-GT.

Text labeling

These tools will streamline the labeling workflow for NLP-related tasks, such as sentiment analysis, entity linking, text categorization, syntactic parsing and tagging, or parts-of-speech tagging.

Labelbox mentioned above can also be used for text labeling. Besides providing basic labeling options, the tool allows for development, installation, and maintenance of custom labeling interfaces.

Stanford CoreNLP. Data scientists share their developments and knowledge voluntarily and for free in many cases. The Stanford Natural Language Processing Group representatives offer a free integrated NLP toolkit, Stanford CoreNLP, that allows for completing various text data preprocessing and analysis tasks.

Bella. Worth trying out, bella is another open tool aimed at simplifying and speeding up text data labeling. Usually, if a dataset was labeled in a CSV file or Google spreadsheets, specialists need to convert it to an appropriate format before model training. Bella’s features and simple interface make it a good substitution to spreadsheets and CSV files.

A graphical user interface (GUI) and a database backend for managing labeled data are bella’s main features.

A user creates and configures a project for every labeling dataset he or she wants to label. Project settings include item visualization, types of labels (i.e. positive, neutral, and negative) and tags to be supported by the tool (i.e. tweets, Facebook reviews).

Tagtog. Tagtog is a startup that provides the same name web tool for automated text annotation and categorization. Customers can choose three approaches: annotate text manually, hire a team that will label data for them, or use machine learning models for automated annotation.

Editor for manual text annotation with automatically adaptive interface

Both data science beginners and professionals can use Tagtog because it doesn’t require knowledge of coding and data engineering.

Dataturks. Dataturks is also a startup that provides training data preparation tools. Using its products, teams can perform such tasks as parts-of-speech tagging, named-entity recognition tagging, text classification, moderation, and summarization. Dataturks presents “upload data, invite collaborators, and start tagging” workflow and allows clients to forget about working with Google and Excel spreadsheets, as well as CSV files.

Three business plans are available for users. The first package is free but provides limited features. Two others are designed for small and large teams.

Besides text data, tools by Dataturks allow for labeling image, audio, and video data.

Specialists also recommend checking such services and tools as brat and SUTDAnnotator.

Audio labeling

You’ll need effective and easy to use labeling tools to train high-performance neural networks for sound recognition and music classification tasks. Here are some of them.

Praat. Praat is a popular free software for labeling audio files. Using Praat, you can mark timepoints of events in the audio file and annotate these events with text labels in a lightweight and portable TextGrid file. This tool allows for working with both sound and text files at the same time as text annotations are linked up with the audio file. Data scientist Kristine M. Yu notes that a text file can be easily processed with any scripts for efficient batch processing and modified separately from an audio file.

Speechalyzer. This tool’s name, Speechalyzer, speaks for itself. The software is designed for manual processing of large speech datasets. To show an example of its high performance, developers highlight they’ve labeled several thousand audio files in almost real time.

EchoML. EchoML is another tool for audio file annotation. It allows users to visualize their data.

As there are many tools for labeling all types of data available, choosing the one that fits your project best won’t be a simple task. Data science practitioners suggest considering such factors as setup complexity, labeling speed, and accuracy when making a choice.

Conclusion

Obtaining high-quality labeled data is a development barrier that becomes more significant when complex models must be built.

With a variety of annotation tools available online, the main challenge for a data science team is to estimate which software will work best for a specific project in terms of functionality and cost.

In addition to manual labeling approaches, data scientists have found new methods that partly automate the process and reduce the need for human involvement. We believe the development of these approaches will be the main trend in the near future.

Original. Reposted with permission.

Related: