Object Detection: An Overview in the Age of Deep Learning

Like many other computer vision problems, there still isn’t an obvious or even “best” way to approach the problem of object recognition, meaning there’s still much room for improvement.

Problems and challenges with object detection

Let’s start getting deeper into which are the main issues of object detection.

Variable number of objects

We already mentioned the part about a variable number of objects, but we omitted why it’s a problem at all. When training machine learning models, you usually need to represent data into fixed-sized vectors. Since the number of objects in the image is not known beforehand, we would not know the correct number of outputs. Because of this, some post-processing is required, which adds complexity to the model.

Historically, the variable number of outputs has been tackled using a sliding window based approach, generating the fixed-sized features of that window for all the different positions of it. After getting all predictions, some are discarded and some are merged to get the final result.

Sliding window example.



Another big challenge is the different conceivable sizes of objects. When doing simple classification, you expect and want to classify objects that cover most of the image. On the other hand, some of the objects you may want to find could be a small as a dozen pixels (or a small percentage of the original image). Traditionally this has been solved with using sliding windows of different sizes, which is simple but very inefficient.


A third challenge is solving two problems at the same time. How do we combine the two different types of requirements: location and classification into, ideally, a single model?

Before diving into deep learning and how to tackle these challenges, let’s do a quick run-up of the classical methods.

Classical approach

Although there have been many different types of methods throughout the years, we want to focus on the two most popular ones (which are still widely used).

The first one is the Viola-Jones framework proposed in 2001 by Paul Viola and Michael Jones in the paper Robust Real-time Object Detection. The approach is fast and relatively simple, so much that it’s the algorithm implemented in point-and-shoot cameras which allows real-time face detection with little processing power.

We won’t go into details on how it works and how to train it, but at the high level, it works by generating different (possibly thousands) simple binary classifiers using Haar features. These classifiers are assessed with a multi-scale sliding window in cascade and dropped early in case of a negative classification.

Another traditional and similar method is using Histogram of Oriented Gradients(HOG) features and Support Vector Machine (SVM) for classification. It still requires a multi-scale sliding window, and even though it’s superior to Viola-Jones, it’s much slower.

Deep learning approach

It’s not news that deep learning has been a real game changer in machine learning, especially in computer vision. In a similar way that deep learning models have crushed other classical models on the task of image classification, deep learning models are now state of the art in object detection as well.

Now that you probably have a better intuition on what the challenges are and how to tackle them, we will do an overview on how the deep learning approach has evolved in the last couple of years.


One of the first advances in using deep learning for object detection was OverFeat from NYU published in 2013. They proposed a multi-scale sliding window algorithm using Convolutional Neural Networks (CNNs).


Quickly after OverFeat, Regions with CNN features or R-CNN from Ross Girshick, et al. at the UC Berkeley was published which boasted an almost 50% improvement on the object detection challenge. What they proposed was a three stage approach:

  • Extract possible objects using a region proposal method (the most popular one being Selective Search).
  • Extract features from each region using a CNN.
  • Classify each region with SVMs.

R-CNN Architecture. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." 2014.

While it achieved great results, the training had lots of problems. To train it you first had to generate proposals for the training dataset, apply the CNN feature extraction to every single one (which usually takes over 200GB for the Pascal 2012train dataset) and then finally train the SVM classifiers.

Fast R-CNN

This approach quickly evolved into a purer deep learning one, when a year later Ross Girshick (now at Microsoft Research) published Fast R-CNN. Similar to R-CNN, it used Selective Search to generate object proposals, but instead of extracting all of them independently and using SVM classifiers, it applied the CNN on the complete image and then used both Region of Interest (RoI) Pooling on the feature map with a final feed forward network for classification and regression. Not only was this approach faster, but having the RoI Pooling layer and the fully connected layers allowed the model to be end-to-end differentiable and easier to train. The biggest downside was that the model still relied on Selective Search (or any other region proposal algorithm), which became the bottleneck when using it for inference.

Fast R-CNN Architecture. Girshick, Ross. "Fast R-CNN" 2015.



Shortly after that, You Only Look Once: Unified, Real-Time Object Detection (YOLO) paper published by Joseph Redmon (with Girshick appearing as one of the co-authors). YOLO proposed a simple convolutional neural network approach which has both great results and high speed, allowing for the first time real time object detection.

YOLO Architecture. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." 2016.


Faster R-CNN

Subsequently, Faster R-CNN authored by Shaoqing Ren (also co-authored by Girshick, now at Facebook Research), the third iteration of the R-CNN series. Faster R-CNN added what they called a Region Proposal Network (RPN), in an attempt to get rid of the Selective Search algorithm and make the model completely trainable end-to-end. We won’t go into details on what the RPNs does, but in abstract it has the task to output objects based on an “objectness” score. These objects are used by the RoI Pooling and fully connected layers for classification. We will go into much more detail in a subsequent post where we will discuss the architecture in detail.

Faster R-CNN Architecture. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." 2015.



Finally, there are two notable papers, Single Shot Detector (SSD) which takes on YOLO by using multiple sized convolutional feature maps achieving better results and speed, and Region-based Fully Convolutional Networks (R-FCN) which takes the architecture of Faster R-CNN but with only convolutional networks.

Importance of datasets

Datasets play a very important (and sometimes underrated) role in research. Every time a new dataset is released, papers are released, and new models are compared and often improved upon, pushing the limits of what’s possible.

Unfortunately, there aren’t enough datasets for object detection. Data is harder (and more expensive) to generate, companies probably don’t feel like freely giving away their investment, and universities do not have that many resources.

There are still some great ones, below is a list of the main available datasets.




In conclusion, there are many opportunities regarding object detection, both in unseen applications and in new methods for pushing state of the art results. Even though this was just a general overview of object detection, we hope it gives you a basic understanding and a baseline for getting deeper knowledge (no pun intended).

In the following weeks, we’ll do a series where we’ll go into details regarding implementation, metrics for evaluations and training these big models. We’ll also venture past object detection into other types of problems that sprout from it.

You should really subscribe to our newsletter so that you don’t miss out on our future posts.

PS: one more thing... at Tryolabs we are working on an open-source computer vision toolkit using some of the models mentioned. So if you want a pre-release look at it, you should definitely stay tuned.

Bio: Javier Rey is a Research Engineer at Tryolabs.

Original. Reposted with permission.