Inside Deep Learning: Computer Vision With Convolutional Neural Networks

Deep Learning-powered image recognition is now performing better than human vision on many tasks. We examine how human and computer vision extracts features from raw pixels, and explain how deep convolutional neural networks work so well.



The actual detection process works by taking the convolution of the filter with the original image. The figure above shows us the results (shown on the right) when we perform the convolution. The outputs of the convolutions, which locate the positions of the features in the original image, are our feature maps. To turn these primitives into concrete structures in a neural network, we can use the scheme described in the figure below. In this scheme, layers of neurons in a feed-forward neural net represent either the original image or a feature map. Filters represent combinations of connections (one such combination is highlighted) that get replicated across the entirety of the input. In the image below, connections of the same color are restricted to always have the same weight. We can achieve this by initializing all the connections in a group with identical weights and by always averaging the weight updates of a group before applying them at the end of each iteration. The output layer (colored orange) is the feature map generated by this filter. A neuron in the feature map is activated if the filter contributing to its activity detected an appropriate feature at the corresponding position in the previous layer. Computer Vision Feature Map But we can be even fancier to do more meaningful tasks. Let’s say that two eyes, a nose, and a mouth characterize a face. That means that if we want a feature map at layer m that detects a face, we’ll need information from three different feature maps represented at the ¬m – 1st layer of neurons. This means we must be able to construct filters that depend on entire volumes of information and might traverse different feature maps within a layer. These sorts of relationships can be captured by using a full-fledged convolutional layer that does just that: Computer Vision Layers Now as we move further and further into our convolutional net, we may want sharpen the information our feature maps. This is because when a feature is present in the previous layer, the resulting feature map tends to contain a central hotspot surrounded by a halo where the feature may be weakly but not completely detected. Related: