# An Intuitive Explanation of Convolutional Neural Networks

This article provides a easy to understand introduction to what convolutional neural networks are and how they work.

Another good way to understand theÂ Convolution operationÂ is by looking at the animation in **Figure 6**Â below:

###### Figure 6: The Convolution Operation. Source [9]

A filter (with red outline) slidesÂ over the inputÂ image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation capturesÂ the local dependancies in the original image. Also notice how these two different filters generateÂ different featureÂ maps from the same original image. Remember that the image and the twoÂ filters above are just numeric matrices as we have discussed above.

In practice, a CNN *learns* the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more number of filters weÂ have, the more image features get extracted and the better our network becomes at recognizingÂ patterns in unseen images.

The size of the Feature Map (Convolved Feature) is controlled by three parameters [4]Â that we need to decide before the convolution step is performed:

**Depth:**Â DepthÂ corresponds to the number of filters weÂ use for the convolution operation.Â In the network shown inÂ**Figure 7**, we areÂ performing convolution of the original boat image using threeÂ distinct filters, thus producing threeÂ different featureÂ maps as shown. You can think of these three feature maps as stacked 2d matrices, so,Â the ‘depth’ of the feature map would be three.

###### FigureÂ 7

**Stride:**Stride is**Â**the number of pixelsÂ byÂ which we slide ourÂ filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.

**Zero-padding:**Sometimes, it is convenient to pad the input matrixÂ with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. AÂ nice feature of zero padding is that it allows us to control the size of the featureÂ maps.Â Adding zero-padding is also called*wide convolution***,**andÂ not using zero-padding would be a*narrow convolution*.Â This has been explained clearlyÂ in [14].

#### Introducing Non Linearity (ReLU)

An additional operation called ReLUÂ has been used after every Convolution operation in **Figure 3**Â above. ReLUÂ stands for Rectified Linear Unit and is a non-linear operation. Its outputÂ is givenÂ by:

###### Figure 8: the ReLU operation

ReLUÂ is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).

The ReLU operation can be understood clearly from **Figure 9**Â below. It shows the ReLU operation applied to one of the feature maps obtained in **FigureÂ 6**Â above. The output feature map here is also referred to as the ‘Rectified’ feature map.

###### FigureÂ 9: ReLU operation. SourceÂ [10]

Other non linear functions such asÂ **tanh **orÂ **sigmoid** can also be used instead of ReLU, but ReLUÂ has been found to perform better in most situations.

#### The PoolingÂ Step

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retainsÂ the most importantÂ information. Spatial Pooling can be of different types: Max, Average, Sum etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take theÂ largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

**Figure 10**Â shows an example of Max Pooling operation on a RectifiedÂ Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.

###### Figure 10: Max Pooling. Source [4]

We slide our 2 x 2 windowÂ by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in **Figure 10**, this reduces the dimensionality of our featureÂ map.

In the network shown inÂ **Figure 11,**Â pooling operation is applied separately to each feature map (notice that, due to this, we getÂ threeÂ output maps from three input maps).

###### Figure 11: Pooling applied to Rectified Feature Maps

**Figure 12**Â shows the effect of Pooling on the Rectified FeatureÂ Map we received after the ReLU operation in **Figure 9**Â above.

###### Figure 12: Pooling. SourceÂ [10]

The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling

- makes the input representations (feature dimension) smaller and more manageable
- reduces the numberÂ of parameters and computations in the network, therefore, controlling overfittingÂ [4]
- makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output ofÂ Pooling – since we take the maximum / average value in a local neighborhood).
- helps usÂ arrive at an almost scale invariant representation of our image (the exact term is “equivariant”). This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).

#### Story so far

###### Figure 13

So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layersÂ are the basic building blocks of any CNN. As shown in **Figure 13**, we have two sets of Convolution, ReLU & Pooling layers – the 2nd ConvolutionÂ layerÂ performs convolutionÂ on the output of the first Pooling Layer using sixÂ filters to produce a total of sixÂ featureÂ maps. ReLU is then applied individually on all of these sixÂ featureÂ maps.Â We then perform Max Pooling operation separately on each of the sixÂ rectified featureÂ maps.

Together these layersÂ extract the useful features from the images,Â introduce non-linearity in our network andÂ reduce feature dimension whileÂ aiming to make the features somewhat equivariant to scale and translation [18].

The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.

#### Fully Connected Layer

The Fully Connected layerÂ is a traditional Multi Layer Perceptron that uses a softmax activation functionÂ in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post).Â The term “Fully Connected” implies that every neuron inÂ the previous layer is connected to every neuron on the next layer.Â I recommend reading this postÂ if you are unfamiliar with Multi Layer Perceptrons.

The output from the convolutional and pooling layers represent high-level features ofÂ the input image.Â The purpose of the Fully Connected layer is to use these features for classifyingÂ the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has fourÂ possible outputs as shown in **Figure 14** belowÂ (note that Figure 14 does not show connections between the nodes in the fully connected layer)

###### Figure 14: Fully Connected Layer -each node is connected to every other node in the adjacent layer

Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most ofÂ the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].

The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer.Â The Softmax functionÂ takes a vector of arbitrary real-valued scoresÂ and squashes it to a vector of values between zero and one that sum to one.