Building a Computer Vision Model: Approaches and datasets

How can we build a computer vision model using CNNs? What are existing datasets? And what are approaches to train the model? This article provides an answer to these essential questions when trying to understand the most important concepts of computer vision.

By Javier Couto, Tryolabs.

Computer vision is one of the hottest subfields of machine learning, given its wide variety of applications and tremendous potential. Its goal: to replicate the powerful capacities of human vision. But how is this achieved with algorithms?

Let's have a loot at the most important datasets and approaches.

Existing datasets

Computer vision algorithms are no magic. They need data to work, and they can only be as good as the data you feed in. These are different sources to collect the right data, depending on the task:

One of the most voluminous and well known dataset is ImageNet, a readily-available dataset of 14 million images manually annotated using WordNet concepts. Within the global dataset, 1 million images contain bounding box annotations.

ImageNet images with bounding boxes. Image source

ImageNet images with object attributes annotations. Image source

Another well-known one is the Microsoft Common Objects in Context (COCO), dataset, loaded with 328,000 images including 91 object types that would be easily recognizable by a 4 year old, with a total of 2.5 million labeled instances.

Examples of annotated images from the COCO dataset. Image source

While there isn’t a plethora of available datasets, there are several suitable for different tasks, such as the CelebFaces Attributes Dataset (CelebA, a face attributes dataset with more than 200K celebrity images); the Indoor Scene Recognition dataset (15,620 images of indoor scenes); and the Plant Image Analysis dataset (1 million images of plants from 11 different species).

A general strategy

Deep learning methods and techniques have profoundly transformed computer vision, along with other areas of artificial intelligence, to such an extent that for many tasks its use is considered standard. In particular, Convolutional Neural Networks (CNN) have achieved beyond state-of-the-art results utilizing traditional computer vision techniques.

These four steps outline a general approach to building a computer vision model using CNNs:

  1. Create a dataset comprised of annotated images or use an existing one. Annotations can be the image category (for a classification problem); pairs of bounding boxes and classes (for an object detection problem); or a pixel-wise segmentation of each object of interest present in an image (for an instance segmentation problem).
  2. Extract, from each image, features pertinent to the task at hand. This is a key point in modeling the problem. For example, the features used to recognize faces, features based on facial criteria, are obviously not the same as those used to recognize tourist attractions or human organs.
  3. Train a deep learning model based on the features isolated. Training means feeding the machine learning model many images and it will learn, based on those features, how to solve the task at hand.
  4. Evaluate the model using images that weren’t used in the training phase. By doing so, the accuracy of the training model can be tested.

This strategy is very basic but it serves the purpose well. Such an approach, known as supervised machine learning, requires a dataset that encompasses the phenomenon the model has to learn.

Training an object detection model

Viola and Jones approach

There are many ways to address object detection challenges. For years, the prevalent approach was one proposed by Paul Viola and Michael Jones in the paper, Robust Real-time Object Detection.

Although it can be trained to detect a diverse range of object classes, the approach was first motivated by the objective of face detection. It is so fast and straightforward that it was the algorithm implemented in point-and-shoot cameras, which allows for real-time face detection with little processing power.

The central feature of the approach is to train with a potentially large set of binary classifiers based on Haar features. These features represent edges and lines, and are extremely simple to compute when scanning an image.

Haar features. Image source

Although quite basic, in the specific case of faces these features allow for the capturing of important elements such as the nose, mouth, or the distance between the eyebrows. It is a supervised method that requires many positive and negative examples of the type of object to be discerned.

Detecting the face of the Mona Lisa

CNN-based approaches

Deep learning has been a real game changer in machine learning, especially in computer vision, where deep-learning-based approaches are now cutting edge for many of the usual tasks.

Among the different deep learning approaches proposed for accomplishing object detection, R-CNN (Regions with CNN features) is particularly simple to understand. The authors of this work propose a three stage process:

  1. Extract possible objects using a region proposal method.
  2. Identify features in each region using a CNN.
  3. Classify each region utilizing SVMs.

R-CNN Architecture. Image source

The region proposal method opted for in the original work was Selective Search, although the R-CNN algorithm is agnostic regarding the particular region proposal method adopted. Step 3 is very important as it decreases the number of object candidates, which makes the method less computationally expensive.

The features extracted here are less intuitive than the Haar features previously mentioned. To summarize, a CNN is used to extract a 4096-dimensional feature vector from each region proposal. Given the nature of the CNN, it is necessary that the input always have the same dimension. This is usually one of the CNN’s weak points and the various approaches address this in different ways. With respect to the R-CNN approach, the trained CNN architecture requires inputs of a fixed area of 227 × 227 pixels. Since the proposed regions have sizes that differ from this, the authors’ approach simply warps the images so that they fit the required dimension.

Examples of warped images matching the input dimension required by the CNN. Image source

While it achieved great results, the training encountered several obstacles, and the approach was eventually outperformed by others. Some of those are reviewed in depth in the article, Object Detection with Deep Learning: The Definitive Guide.

This article is an extract of the Introductory Guide to Computer Vision by Tryolabs, originally published here.

Bio: Javier Couto is a freelance machine learning consultant and data scientist at Tryolabs, specialized in applying machine learning to solve business problems.