Deep Learning & CNNs Overview for AI&DS
Deep Learning & CNNs Overview for AI&DS
Module 2: Basics of Supervised Deep Learning, Introduction, Convolution Neural Network, Evolution
of Convolution Neural Network, Architecture of CNN, Continuation of Architecture of CNN,
Convolution Operation.
Since the 1950s, the early days of AI, researchers have struggled to make a system that can understand
visual data. In the following years, this field came to be known as Computer Vision. In 2012, computer
vision took a quantum leap when a group of researchers from the University of Toronto developed an
AI model that surpassed the best image recognition algorithms, and that too by a large margin.
The AI system, which became known as AlexNet (named after its main creator, Alex Krizhevsky), won
the 2012 Image Net computer vision contest with an amazing 85 percent accuracy. The runner-up
scored a modest 74 percent on the test.
At the heart of AlexNet was Convolutional Neural Networks a special type of neural network that
roughly imitates human vision.
In 2012, Alex Krizhevsky realized that it was time to bring back the branch of deep learning that uses
multi-layered neural networks. The availability of large sets of data, to be more specific ImageNet
datasets with millions of labeled images and an abundance of computing resources enabled
researchers to revive CNNs
A. BACKGROUND OF CNNs
CNN’s were first developed and used around the 1980s. The most that a CNN could do at that time was
recognize handwritten digits. It was mostly used in the postal sectors to read zip codes, pin codes, etc.
The important thing to remember about any deep learning model is that it requires a large amount of
data to train and also requires a lot of computing resources. This was a major drawback for CNNs at
that period and hence CNNs were only limited to the postal sectors and it failed to enter the world of
machine learning.
In the past few decades, Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. The interest to use hidden layers has surpassed traditional techniques,
especially in pattern recognition. One of the most popular deep neural networks is Convolutional
Neural Networks (also known as CNN or ConvNet) in deep learning, especially when it comes to
Computer Vision applications.
Over the years CNNs have become a very important part of many Computer Vision applications and
hence a part of any computer vision. So let’s take a look at the workings of CNNs or CNN algorithm in
deep learning.
What Is a CNN?
How does it work?
What Is a Pooling Layer?
Limitations of CNNs
B. WHAT IS A CNN
In deep learning, a Convolutional Neural Network (CNN/ConvNet) is a class of deep neural networks,
most commonly applied to analyze visual imagery Now when we think of a neural network we think
about matrix multiplications but that is not the case with ConvNet. It uses a special technique called
Convolution. Now in mathematics convolution is a mathematical operation on two functions that
produces a third function that expresses how the shape of one is modified by the other.
Bottom line is that the ConvNet role to reduce the images into a form that is easier to process, without
losing features crucial for good prediction.
C. HOW DOES CNN WORK
Before we go to the working of CNN’s let’s cover the basics such as what is an image and how is it
represented. An RGB image is nothing but a matrix of pixel values having three planes whereas a
grayscale image is the same but it has a single plane. Take a look at this image to understand more.
For simplicity, consider gray scale images to understand how CNNs work
The above image shows what a convolution is. We take a filter/kernel (3×3 matrix) and apply it to the
input image to get the convolved feature. This convolved feature is passed on to the next layer
In the case of RGB color, channel take a look at this animation to understand its working.
Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial neurons,
a rough imitation of their biological counterparts, are mathematical functions that calculate the
weighted sum of multiple inputs and outputs an activation value. When you input an image in a
ConvNet, each layer generates several activation functions that are passed on to the next layer.
The first layer usually extracts basic features such as horizontal or diagonal edges. This output is passed
on to the next layer which detects more complex features such as corners or combinational edges. As
we move deeper into the network it can identify even more complex features such as objects, faces,
etc.
Based on the activation map of the final convolution layer, the classification layer outputs a set of
confidence scores (values between 0 and 1) that specify how likely the image is to belong to a “class.”
For instance, if you have a ConvNet that detects cats, dogs, and horses, the output of the final layer is
the possibility that the input image contains any of those animals.
D. WHAT IS A POOLING LAYER
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the
Convolved Features. This is to decrease the computational power required to process the data by
reducing the dimensions. There are two types of pooling average pooling and max pooling.
In Max Pooling, the maximum value of a pixel from a portion of the image covered by the kernel is
found out. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction.
On the other hand, Average Pooling returns the average of all the values from the portion of the image
covered by the Kernel. Average Pooling simply performs dimensionality reduction as a noise
suppressing mechanism. Hence, We can say that Max Pooling performs a lot better than Average
Pooling.
Deep learning is a form of machine learning that requires a neural network with a minimum of three
layers. Networks with multiple layers are more accurate than single-layer networks. Deep learning
applications often use CNNs or RNNs (recurrent neural networks).
The CNN architecture is especially useful for image recognition and image classification, as well as
other computer vision tasks because they can process large amounts of data and produce highly
accurate predictions. CNNs can learn the features of an object through multiple iterations, eliminating
the need for manual feature engineering tasks like feature extraction.
It is possible to retrain a CNN for a new recognition task or build a new model based on an existing
network with trained weights. This is known as transfer learning. This enables ML model developers to
apply CNNs to different use cases without starting from scratch.
A Convolutional Neural Network (CNN) is a type of deep learning algorithm specifically designed for
image processing and recognition tasks. Compared to alternative classification models, CNNs require
less preprocessing as they can automatically learn hierarchical feature representations from raw input
images. They excel at assigning importance to various objects and features within the images through
convolutional layers, which apply filters to detect local patterns.
The connectivity pattern in CNNs is inspired by the visual cortex in the human brain, where neurons
respond to specific regions or receptive fields in the visual space. This architecture enables CNNs to
effectively capture spatial relationships and patterns in images. By stacking multiple convolutional and
pooling layers, CNNs can learn increasingly complex features, leading to high accuracy in tasks like
image classification, object detection, and segmentation.
Convolutional neural networks are known for their superiority over other artificial neural networks,
given their ability to process visual, textual, and audio data. The CNN architecture comprises three
main layers: Convolutional layers, Pooling layers, and a Fully connected (FC) layer.
There can be multiple convolutional and pooling layers. The more layers in the network, the greater
the complexity and (theoretically) the accuracy of the machine learning model. Each additional layer
that processes the input data increases the model’s ability to recognize objects and patterns in the
data.
Convolutional layers are the key building block of the network, where most of the computations are
carried out. It works by applying a filter to the input data to identify features. This filter, known as a
feature detector, checks the image input’s receptive fields for a given feature. This operation is
referred to as convolution.
The filter is a two-dimensional array of weights that represents part of a 2-dimensional image. A filter
is typically a 3×3 matrix, although there are other possible sizes. The filter is applied to a region with in
the input image and calculates a dot product between the pixels, which is fed to an output array. The
filter then shifts and repeats the process until it has covered the whole image. The final output of all
the filter processes is called the feature map.
The CNN typically applies the ReLU (Rectified Linear Unit) transformation to each feature map after
every convolution to introduce nonlinearity to the ML model. A convolutional layer is typically followed
by a pooling layer. Together, the convolutional and pooling layers make up a convolutional block.
Additional convolution blocks will follow the first block, creating a hierarchical structure with later
layers learning from the earlier layers. For example, a CNN model might train to detect cars in images.
Cars can be viewed as the sum of their parts, including the wheels, boot, and wind screen. Each feature
of a car equates to a low-level pattern identified by the neural network, which then combines these
parts to create a high-level pattern.
A pooling or down sampling layer reduces the dimensionality of the input. Like a convolutional
operation, pooling operations use a filter to sweep the whole input image, but it doesn’t use weights.
The filter instead uses an aggregation function to populate the output array based on the receptive
field’s values.
Pooling layers are important despite causing some information to be lost, because they help reduce
the complexity and increase the efficiency of the CNN. It also reduces the risk of overfitting.
A. A convolution tool that separates and identifies the various features of the image for analysis
in a process called as Feature Extraction.
B. The network of feature extraction consists of many pairs of convolutional or pooling layers.
C. A fully connected layer that utilizes the output from the convolution process and predicts the
class of the image based on the features extracted in previous stages.
D. This CNN model of feature extraction aims to reduce the number of features present in a
dataset. It creates new features which summarizes the existing features contained in an
original set of features. There are many CNN layers as shown in the CNN architecture
diagram.
User Input Histogram Gabor Enhancement Histogram
Image Size Equalization Image reshape size Equalization
480 * 320 480 * 320 350 * 223 480 * 320
Max. FC1
Conv
5x5 Pool FC2
Conv Flatter
2x2
3x3
Output
128 x 1
A. CONVOLUTION LAYERS - There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers. When these layers are stacked, a
CNN architecture will be formed. In addition to these three layers, there are two mor e important
parameters which are the dropout layer and the activation function which are defined below.
The output is termed as the Feature map which gives us information about the image such as the
corners and edges. Later, this feature map is fed to other layers to learn several other features of the
input image. The convolution layer in CNN passes the result to the next layer once applying the
convolution operation in the input. Convolutional layers in CNN benefit a lot as they ensure the spatial
relationship between the pixels is intact.
In Max Pooling, the largest element is taken from feature map. Average Pooling calculates the average
of the elements in a predefined sized Image section. The total sum of the element in the predefined
section is computed in SumPooling. The Pooling Layer usually serves as a bridge between the
Convolutional Layer and the FC Layer. This CNN model generalises the features extracted by the
convolution layer, and helps the networks to recognise the features independently. With the help of
this, the computations are also reduced in a network.
2.3 FULLYCONNECTEDLAYER
The Fully Connected (FC) layer consists of the weights and biases along with the neurons and is used to
connect the neurons between two different layers. These layers are usually placed before the output
layer and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the FC layer. The flattened
vector then undergoes few more FC layers where the mathematical functions operations usually take
place. In this stage, the classification process begins to take place. The reason two layers are connected
is that two fully connected layers will perform better than a single connected layer. These layers in
CNN reduce the human supervision
2.4 DROPOUT
Usually, when all the features are connected to the FC layer, it can cause over fitting in the training
dataset. Over fitting occurs when a particular model works so well on the training data causing a
negative impact in the model’s performance when used on a new data.
To overcome this problem, a dropout layer is utilized where in a few neurons are dropped from the
neural network during training process resulting in reduced size of the model. On passing a dropout of
0.3, 30% of the nodes are dropped out randomly from the neural network.
Dropout results in improving the performance of a machine learning model as it prevents over fitting
by making the network simpler. It drops neurons from the neural networks during training.
2.5 ACTIVATIONFUNCTIONS
Finally, one of the most important parameters of the CNN model is the activation function. They are
used to learn and approximate any kind of continuous and complex relationship between variables of
the network. In simple words, it decides which information of the model should fire in the forward
direction and which ones should not at the end of the network.
It adds non-linearity to the network. There are several commonly used activation functions such as
the ReLU, Softmax, tanH and the Sigmoid functions. Each of these functions have a specific usage. For a
binary classification CNN model, sigmoid and soft max functions are preferred and for a multi-class
classification, generally Softmax is used. In simple terms, activation functions in a CNN model
determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not to predict using mathematical operations.
Binary step function depends on a threshold value that decides whether a neuron should be activated
or not. The input fed to the activation function is compared to a certain threshold, if the input is
greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not
passed on to the next hidden layer
Mathematically, it can be represented as:
Non-linear activation functions solve the following limitations of linear activation functions:
They allow backpropagation because now the derivative function would be related to the
input, and it’s possible to go back and understand which weights in the input neuron scan
provide a better prediction.
They allow the stacking of multiple layers of neurons as the output would now be a non-linear
combination of input passed through multiple layers.
Any output can be represented as a functional computation in a neural network.
Below are different non-linear neural networks activation functions and their characteristics.
This function takes any real value as input and outputs values in the range of 0 to 1. The larger the
input (morepositive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to 0.0, as shown below
It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice
because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.
From the above Figure, the gradient values are only significant for range -3 to 3, and the graph gets
much flatter in other regions. It implies that for values greater than 3 or less than -3, the function
will have very small gradients. As the gradient value approaches zero, the network ceases to learn and
suffers from the Vanishing gradient problem.
The output of the logistic function is not symmetric around zero. So the output of all the
neurons will be of the same sign. This makes the training of the neural network more difficult and
unstable.
The output of the tanh activation function is Zero centered, hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to 1; therefore,
the mean for the hidden layer comes out to be 0 or very close to it. It help since entering the
data and makes learning for the next layer much easier.
It also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the
gradient of the tanh function is much steeper as compared to the sigmoid function.
Note: Al though both Sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and the
gradients are not restricted to move in a certain direction. Therefore, in practice, tanh non linearity is
always preferred to sigmoid non linearity.
The maincatch here is that the ReLU function does not activate all the neurons at the same time. The
neurons will only be deactivated if the output of the linear transformation is less than 0
Mathematically, it can be represented as:
The negative side of the graph makes the gradient value zero. Due to this reason, during the back
propagation process, the weights and biases for some neurons are not updated. This can create dead
neurons which never get activated.
All the negative input values become zero immediately, which decreases the model’s ability to
fit or train from the data properly.
Note: For building the most reliable ML models, split your data into train, validation & test set
C4) Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small
positive slope in the negative area.
The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
backpropagation, even for negative input values. By making this minor modification for negative input
values, the gradient of the left side of the graph comes out to be a non-zero [Link],we would
no longer encounterdead neurons in that region.
Where "a" is the slope parameter for negative values. The parameterized ReLU function is used when
the leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information
is not successfully passed to the next layer. This function’s limitation is that it may perform differently
for different problems depending upon the value of slope parameter a
In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
The GIF here explains how these pooling layers go through the input matrix and computes the
maximum or average for max pooling and average pooling, respectively.
3.5 Global Pooling Layers
Global Pooling Layers often replace the classifier’s fully connected or Flatten layer. The model instead
ends with a convolutional layer that produces as many feature maps as there are target classes and
performs global average pooling on each of the feature maps to combine each feature map into a
single value.
Create the same NumPy array but with a different shape. By keeping the same shape as above, the
Global Pooling layers will reduce them to one value.
3.1 LeNet
LeNet is a pioneering CNN designed for recognizing hand written characters. It was proposed by Yann
LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner in the late 1990s. LeNet consists of a series of
convolutional and pooling layers, as well as a fully connected layer and softmax classifier. It was among
the first successful applications of deep learning for computer vision. It has been used by banks to
identify numbers written on cheques in grayscale input images.
3.2 VGG
VGG (Visual GeometryGroup) is a research group within the Department of Engineering Science at the
University of Oxford. The VGG group is well-known for its work in computer vision, particularly in the
area of convolutional neural networks (CNNs).
One of the most famous contributions from the VGG group is the VGG model, also known as VGGNet.
The VGG model is a deep neural network that achieved state-of-the-art performance on the ImageNet
Large Scale Visual Recognition Challenge in 2014, and has been widely used as a benchmark for image
classification and object detection tasks.
The VGG model is characterized by its use of small convolutional filters (3×3) and deep architecture (up
to 19 layers), which enables it to learn increasingly complex features from input images. The VGG
model also uses max pooling layers to reduce the spatial resolution of the feature maps and increase
the receptive field, which can improve its ability to recognize objects of varying scales and orientations.
The VGG model has inspired many subsequent research efforts in deep learning, including the
development of even deeper neural networks and the use of residual connections to improve gradient
flow and training stability.
3.3 ResNet
ResNet (short for “Residual Neural Network”) is a family of deep convolutional neural networks
designed to overcome the problem of vanishing gradients that are common in very deep networks. The
idea behind ResNet is to use “residual blocks” that allow for the direct propagation of gradients
through the network, enabling the training of very deep networks A residual block consists of two or
more convolutional layers followed by an activation function, combined with a shortcut connection
that bypasses the convolutional layers and adds the original input directly to the output of the
convolutional layers after the activation function.
This allows the network to learn residual functions that represent the difference between the
convolutional layers’ input and output, rather than trying to learn the entire mapping directly. The use
of residual blocks enables the training of very deep networks, with hundreds or thousands of layers,
significantly alleviating the issue of vanishing gradients.
3.4 GoogLeNet
GoogLeNet is notable for its use of the Inception module, which consists of multiple parallel
convolutional layers with different filter sizes, followed by a pooling layer, and concatenation of the
outputs. This design allows the network to learn features at multiple scales and resolutions, while
keeping the computational cost manageable. The network also includes auxiliary classifiers at
intermediate layers, which encourage the network to learn more discriminative features and prevent
overfitting. GoogLeNet builds upon the ideas of previous convolutional neural networks, including
LeNet, which was one of the first successful applications of deep learning in computer vision. However,
GoogLeNet is much deeper and more complex than LeNet
4. DROPOUT:
The term “dropout” refers to dropping out the nodes (input and hidden layer) in a neural network. All
the forward and backwards connections with a dropped node are temporarily removed, thus creating a
[Link] a dropout probability of p.
Consider given input x:{1,2,3,4,5} to the fully connected layer. We have a dropout layer with
probability p = 0.2 (or keep probability = 0.8). During the forward propagation (training) from the input
x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on.
Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).
Generally, for the input layers, the keep probability, i.e. 1- drop probability, is closer to 1, 0.8 being the
best as suggested by the authors. For the hidden layers, the greater the drop probability more s parse
the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.
How does Dropout solve the Overfitting problem?
In the overfitting problem, the model learns the statistical noise. To be precise, the main motive of
training is to decrease the loss function, given all the units (neurons). So in overfitting, a unit may
change in a way that fixes up the mistakes of the other units. This leads to complex co-adaptations,
which inturn leads to the overfitting problem because this complex co-adaptation fails to generalise on
the unseen dataset.
Now, if we use dropout, it prevents these units to fix up the mistake of other units, thus preventing co-
adaptation,as in every iteration, the presence of a unit is highly unreliable. So, by randomly dropping a
few units (nodes), it forces the layers to take more or less responsibility for the input by taking a
probabilistic approach.
This ensures that the model is getting generalised and hence reducing the overfitting problem.
Figure2:(a)Hiddenlayerfeatureswithoutdropout; (b)Hiddenlayerfeatureswithdropout
From figure2, we can easily make out that the hidden layer with dropout is learning more of the
generalised features than the co-adaptations in the layer without dropout. It is quite apparent, that
dropout breaks such inter-unit relations and focuses more on generalization.
Comparative Analysis
Model Depth Params Merits Demerits Applications
Simple, efficient for Limited to small
Digit/OCR
LeNet-5 7 ~60 K low-res digits, inputs; poor
recognition
lightweight scalability
ReLU, dropout, GPU High memory and
training; large-scale compute; General image
AlexNet 8 ~60 M
ImageNet moderate classification
breakthrough overfitting
Improved early-layer
Still heavy; few Image
ZFNet 8 ~62 M resolution; activation
structural changes classification
visualization
Uniform design; Very large; slow
Feature extraction;
VGG-16/19 16/19 138 M excels in transfer inference and
transfer learning
learning training
Inception modules;
Complex design;
parameter-efficient; Classification;
GoogLeNet 22 6.8 M harder to
multi-scale feature detection
implement
capture
Residual
connections; easy Classification;
ResNet- 25 M/60 Still sizable; higher
50/152 training of ultra-deep segmentation;
50/152 M inference latency
nets detection
Combines inception
efficiency with
Very complex;
Inception- residual ease of High-accuracy
~164 ~55 M integration
ResNet training vision tasks
challenges
Depthwise separable Requires careful
Mobile/embedded
Xception 36 23 M convs; improved tuning; marginal
classification
efficiency gains over ResNet
AlexNet accuracy
Lower absolute
with 50× fewer IoT devices;
SqueezeNet 18 1.2 M accuracy; limited
params; ideal for mobile inference
feature learning
embedded systems
Group conv + Slight accuracy Smartphones;
ShuffleNet 50 1.0 M channel shuffle; very drop; complex edge-AI; on-device
low compute cost grouping vision