A 2019 Guide to Semantic Segmentation

Semantic segmentation refers to the process of linking each pixel in an image to a class label. These labels could include a person, car, flower, piece of furniture, etc., just to mention a few. We’ll now look at a number of research papers on covering state-of-the-art approaches to building semantic segmentation models.



Multi-Scale Context Aggregation by Dilated Convolutions (ICLR, 2016)

 
In this paper, a convolution network module that blends multi-scale context information without loss of resolution is developed. This module can then be plugged into existing architectures at any resolution. The module is based on dilated convolutions.

Multi-Scale Context Aggregation by Dilated Convolutions
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had...
 

The module was tested on the Pascal VOC 2012 dataset. It proves that adding a context module to existing semantic segmentation architectures improves their accuracy.

 

The front-end module trained in experimentation achieves 69.8% mean IoU on the VOC-2012 validation set and 71.3% mean IoU on the test set. The prediction accuracy of this model on different objects is shown below

 

 

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (TPAMI, 2017)

 
In this paper the authors make the following contributions to the task of semantic segmentation with deep learning:

  • Convolutions with upsampled filters for dense prediction tasks
  • Atrous spatial pyramid pooling (ASPP) for segmenting objects at multiple scales
  • Improving localization of object boundaries by using DCNNs.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully...
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions...
 

The paper’s proposed DeepLab system achieves a 79.7% mIOU on the PASCAL VOC-2012 semantic image segmentation task.

 

The paper tackles the main challenges of using deep CNNs in semantic segmentation, which include:

  • Reduced feature resolution caused by a repeated combination of max-pooling and downsampling.
  • Existence of objects at multiple scales.
  • Reduced localization accuracy caused by DCNN’s invariance since an object-centric classifier requires invariance to spatial transformations.

 

Atrous convolution is applied by either upsampling the filters by inserting zeros or sparsely sampling the input feature maps. The second method entails subsampling the input feature maps by a factor equal to the atrous convolution rate r, and deinterlacing it to produce r^2 reduced resolution maps, one for each of the r×r possible shifts. After this, a standard convolution is applied to the immediate feature maps, interlacing them with the image’s original resolution.

 

 

Rethinking Atrous Convolution for Semantic Image Segmentation (2017)

 
This paper addresses two challenges (mentioned previously) in using DCNNs for semantic segmentation; reduced feature resolution that occurs when consecutive pooling operations are applied and the existence of objects at multiple scales.

Rethinking Atrous Convolution for Semantic Image Segmentation
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as...
 

In order to address the first problem, the paper suggests the use of atrous convolution, also known as dilated convolution. It proposes solving the second problem using atrous convolution to enlarge the field of view and hence include multi-scale context.

 

The paper’s ‘DeepLabv3’ achieves a performance of 85.7% on the PASCAL VOC 2012 test set without DenseCRF post-processing.

 

 

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (ECCV, 2018)

 
This paper’s approach ‘DeepLabv3+’ achieves the test set performance of 89.0% and 82.1% without any post-processing on PASCAL VOC 2012 and Cityscapes datasets. This model is an extension of DeepLabv3 by adding a simple decoder module to refine the segmentation results.

Papers With Code : Encoder-Decoder with Atrous Separable Convolution for Semantic Image...
🏆 SOTA for Semantic Segmentation on PASCAL VOC 2012(Mean IoU metric)
 

 

The paper implements two types of neural networks that use a spatial pyramid pooling module for semantic segmentation. One captures contextual information by pooling features at different resolutions, while the other obtain sharp object boundaries.

 

 

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation (2019)

 
This paper proposes a joint upsampling module named Joint Pyramid Upsampling (JPU) to replace the dilated convolutions that consume a lot of time and memory. It works by formulating the function of extracting high-resolution maps as a joint upsampling problem.

Papers With Code : FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic...
🏆 SOTA for Semantic Segmentation on PASCAL Context(mIoU metric)
 

This method achieves a performance of mIoU of 53.13% on the Pascal Context dataset and runs 3 times faster.

 

The method implements a fully-connected network(FCN) as the backbone while applying JPU to upsample the low-resolution final feature maps, resulting in high-resolution feature maps. Replacing the dilated convolutions with the JPU does not result in any loss of performance.

 

Joint sampling uses low-resolution target images and high-resolution guidance images. It then generates high-resolution target images by transferring the structure and details of the guidance image.

 

Improving Semantic Segmentation via Video Propagation and Label Relaxation (CVPR, 2019)

 
This paper proposes a video-based method to scale the training set by synthesizing new training samples. This is aimed at improving the accuracy of semantic segmentation networks. It explores the ability of video prediction models to predict future frames in order to predict future labels.

Papers With Code : Improving Semantic Segmentation via Video Propagation and Label Relaxation
🏆 SOTA for Semantic Segmentation on Cityscapes(Mean IoU metric)
 

 

The paper demonstrates that training segmentation networks on datasets from synthesized data lead to improved prediction accuracy. The methods proposed in this paper achieve mIoUs of 83.5% on Cityscapes and 82.9% on CamVid.

 

The paper proposes two ways of predicting future labels:

  • Label Propagation (LP) creating new training samples by pairing a propagated label with the original future frame
  • Joint image-label Propagation (JP) creating new training samples by pairing a propagated label with the corresponding propagated image

The paper has three main propositions; utilizing video prediction models to propagate labels to immediate neighbor frames, introducing joint image-label propagation to deal with the misalignment problem, and relaxing one-hot label training by maximizing the likelihood of the union of class probabilities along the boundary.

 

 

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation (2019)

 
This paper is the newest kid on the semantic segmentation block. The authors proposes a two-stream CNN architecture. In this architecture, shape information is processed as a separate branch. This shape stream processes only boundary related information. This is enforced by the model’s Gated Convolution Layer (GCL) and local supervision.

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation
Current state-of-the-art methods for image segmentation form a dense image representation where the color, shape and...
 

 

This model outperforms the DeepLab-v3+ by 1.5 % on mIoU and 4% in F-boundary score. The model has been evaluated using the Cityscapes benchmark. On smaller and thinner objects, the model achieves an improvement of 7% on IoU.

The table below shows the performance of the Gated-SCNN in comparison to other models.

 

 

Conclusion

 
We should now be up to speed on some of the most common — and a couple of very recent — techniques for performing semantic segmentation in a variety of contexts.

The papers/abstracts mentioned and linked to above also contain links to their code implementations. We’d be happy to see the results you obtain after testing them.

 
Bio: Derrick Mwiti is a data analyst, a writer, and a mentor. He is driven by delivering great results in every task, and is a mentor at Lapid Leaders Africa.

Original. Reposted with permission.

Related: