In-Depth Review of Convolutional Neural Networks (CNN’s)

Umair Ayub
8 min readJan 29, 2022

In the coming sections, I shall explain the origins of CNNs, their structure, and their applications.

Before CNN's deep learning had resulted in computers being better than humans in playing games like AlphaGO but however, computers were not able to work with unstructured data like images, texts. It is only recently that we have turned computers into superhuman machines that are able to do all these complex tasks like image and text classification.

To us, humans, when we look at a picture the contents within that frame seem obvious. Suppose there is an image of a dog licking its owner. We cannot neglect the fact that there is a dog in the frame, it becomes second nature to us to observe these things. Until recently computers were not able to perceive these things.

Now the Convolutional Neural Networks (CNN) emerged. The study of the human brain’s visual cortex led us to the discovery of CNNs. Since then CNNs have been used extensively for image recognition. Nowadays CNNs are being used for all sorts of things like automated driving, image, and video classification systems. These are not only being used with pictorial data but have also made huge strides in computing text data through Natural Language Processing (NLP).

History of CNNs

Since the development of CNNs is related to our understanding of the brain’s visual cortex. We first need to understand how our brain perceives images. David Hubel and Torsten Wiesel in the year 1959 discovered the working of simple and complex cells in our body. Based on their study we use two different types of cells for recognizing visual patterns. A simple cell is only able to recognize edges and bars of a particular orientation at a particular part of the image. In contrast, complex cells are not only able to do whatever simple cells do but are also capable of recognizing the edges and bars at any location in the image.

Suppose we have an image present in front of us. While the simple cells will only be able to recognize a horizontal bar in the upper right-handed corner, the complex cells will be able to recognize the horizontal bar in all locations present in the image. Complex cells can simply be regarded as the combination of several simple cells. Our body uses the coherent understanding between simple and complex cells to form the visual system.

Next is the concept of the grandmother or gnostic cell. It simply refers to a hypothetical cell within our body that is only activated upon the sight of a complex image such as one’s grandmother. Hence several of these complex neurons are connected and are activated when they sight the complex image that is the grandmother.

The Neocognitron

Dr. Kunihiko Fukushima in the 1980s developed an artificial neural network while being inspired by the work of David Hubel and Torsten Wiesel. The network mimicked the working of the simple and complex cells present in our body. The network did not correspond to the biological neuron but instead represented the algorithmic structure of the simple and complex cells. The main idea that Fukushima had in mind was that the Neocognitron should be able to identify complex images or patterns. The whole image or pattern would have been identified by the complex cells based on the identification of each individual part by the simple cells.

Suppose we want to identify the image of a cat. The complex should be able to detect the presence of a cat as a whole while the simple cells should be able to detect individual parts like paws, tail, whiskers, etc.

The LeNet

The work done by Fukushima paved the way for many researchers to explore this field. The first major breakthrough occurred in the 1990s when a modern application based on convolutional neural networks was implemented by Yann LeCun. His paper “Gradient-Based Learning Applied to Document Recognition” became the most popular research paper of that time. It is used as a guide for anyone starting out with learning CNNs.

In the paper, he described how he used CNNs to train the handwritten digits dataset (MNIST). He built upon Fukushima’s Neocognitron thus assembling many complex features by utilizing the artificial cells.

This was all about the brief history of CNNs and how they came into being. In the next sections, I shall explain some of the important terms associated with CNNs.

Convolution

Essentially the main component of the CNN is the convolutional layer. In mathematical terms, convolution can be defined as an operation that involves the multiplication of weights and input. The first layer of CNN is the convolutional layer. Unlike the ANNs where every input is fully connected, here the first layer of CNN is connected only to the pixels of their particular field. Similarly, the second convolutional layer is connected with the respective pixels of the first layer. Hence we are assembling low-level features in the first layer and then combining these together to create high-level features. The real-world images also possess a similar hierarchical structure. This is one of the main reasons why CNN works so well with real-world image data.

The convolution Layer mainly deals with two- dimensional data, and hence the multiplication between weights and input is usually done between a 2D array of weights and an array of inputs. This 2D array of weights is known as a filter or a kernel. It is mandatory to keep the size of the filter smaller than the input. The size of the input is hence reduced to match the size of the filter and the multiplication between them is known as the dot product. Basically, the dot product involves element-wise multiplication and the results are summed to give the final output in the form of a scalar quantity.

We tend to keep the size of the filter smaller than the input as it results in the multiplication of the same weights with the input array occurring multiple times. This technique becomes much more powerful when it is applied systematically to each and every overlapping part of the input. Since here the multiplication between the weights and input array occurs several times, the output is a 2D array containing the filtered values of the input. This output 2D array is also known as a feature map. Each filtered value present in the feature map is then passed through some non-linear activation function line ReLU. This part of the process is then similar to what happens in the case of a fully connected layer. The distance between two consecutive receptive fields is known as a stride.

Fig: Convolution process

Pooling Layer

Now that we have seen the working of the convolutional layer, we move on to the second layer present in the CNN architecture which is the pooling layer. The main goal of using pooling layers is to reduce the risk of overfitting. This is achieved by reducing the size of the convolved feature map. This helps in reducing the computational load, saves memory, and overall reduces the number of parameters. Pooling layers, therefore, downsamples the feature map obtained through the convolutional layer.

Downsampling refers to the process of producing a lower resolution image of input that still contains all the important features but does not address the finer details of the original input. They summarize the features present in the feature map by combining them in patches.

This layer involves the pooling operation and the size of the pooling operation is always less than the size of the feature maps. Typically we use pooling operations having a stride of 2 pixels. This means that we are essentially reducing the size of the feature maps by a factor of 2. Suppose we have a feature map of size 8x8 (64 pixels), after applying the pooling operation this will reduce down to a size of 4x4 (16 pixels). The two main pooling methods that we use most frequently are average pooling and max pooling.

1. Average Pooling: In this method, we compute the average value of each patch present in the feature map.

2. Maximum pooling: In this method, we compute the maximum value of each patch present in the feature map.

Architecture of CNN

Now that we have seen the different layers and their functions, let’s take a broader look at the overall architecture of convolutional neural networks. Essentially a CNN is made up of two parts:

1. The input, convolutional layer, and the pooling layer make up the Feature Extraction part of the CNN.

2. A fully connected layer that works on the output generated by the convolution process to predict the classes of different images.

Fig: Architecture of CNN

We have already explained the working of both convolutional and pooling layers. Now let us take a look at the Classification part of the CNN architecture.

Fully Connected Layer

The fully connected layer should always be the last layer of the convolutional neural network. It receives the output generated from the convolutional process as input. It then transforms the input by applying several linear transformations and finally an activation function to predict the class of an image.

The output of the fully connected layer is in the form of a vector of size N, where N specifies the number of classes present in our problem. Every element of this output vector is associated with the probability of a specific image belonging to a particular class.

The probability is calculated by simply obtaining the product of each input with the corresponding weight, taking the sum of these products, and then applying an activation function on top of it. In this layer each and every input is connected with all the possible outputs, hence we use the term “fully connected”. The weights which are multiplied by the inputs are learned by the CNN during the training phase by using backpropagation. This layer makes use of different types of activation functions like ReLU, tanh, softmax function, etc.

Dropout

By connecting all the features of the dataset to our fully connected layer, we are increasing the chances of overfitting. This can lead to the degradation of our model’s performance on test data.

In order to prevent overfitting, we use something known as a dropout layer. By using this around 30% of the neurons are randomly dropped from the network. This generally occurs during the training phase of the network and results in reduced-sized networks.

— — — — — — — — — — — — — — — — — — — — — — — — — — — -

With that, you should now understand what are CNN’s and how they work. I hope you liked it!

This article is taken from my book Machine Learning - A Comprehensive Approach.

To learn more about CNN’s and Machine Learning in general, check out my book now available on Amazon: https://amzn.to/3KVCD6g

Feel free to reach out to me on Linkedin.

Thanks for reading and I’ll see you guys in the next one.

--

--

Umair Ayub

Author of “Machine Learning - A Comprehensive Approach”. Interested in Data Science, Machine Learning, and Blockchains.