0% found this document useful (0 votes)
37 views24 pages

Deep Learning & CNNs Overview for AI&DS

This document provides an overview of Convolutional Neural Networks (CNNs) and their applications in computer vision, detailing their architecture, functioning, and benefits in deep learning. It discusses the evolution of CNNs, their components such as convolutional layers, pooling layers, and fully connected layers, as well as techniques like dropout and activation functions to enhance model performance. The document emphasizes the significance of CNNs in image recognition and classification tasks, highlighting their ability to learn features automatically from raw data.

Uploaded by

sunilsandy87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views24 pages

Deep Learning & CNNs Overview for AI&DS

This document provides an overview of Convolutional Neural Networks (CNNs) and their applications in computer vision, detailing their architecture, functioning, and benefits in deep learning. It discusses the evolution of CNNs, their components such as convolutional layers, pooling layers, and fully connected layers, as well as techniques like dropout and activation functions to enhance model performance. The document emphasizes the significance of CNNs in image recognition and classification tasks, highlighting their ability to learn features automatically from raw data.

Uploaded by

sunilsandy87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DEEP LEARNING & RE-INFORCEMENT LEARNING – VII SEM AI&DS (BAI701)


Academic Year: 2025-‘26, Odd Semester (Aug-Dec)

MODULE -2: NOTES


Name of the Faculty : Dr. Pradeep N.R Designation : Associate Professor
Subject name : Deep Learning & Subject code : BAI701
Reinforcement Learning
Department : AI&DS Semester :V
CIE Marks : 50 SEE Marks : 50
Teaching Hrs/Week (L:T:P:S) : ([Link]) Total Marks : 100
Total Hours of Pedagogy : 40 Credits : 04

Module 2: Basics of Supervised Deep Learning, Introduction, Convolution Neural Network, Evolution
of Convolution Neural Network, Architecture of CNN, Continuation of Architecture of CNN,
Convolution Operation.

1.1 INTRODUCTION TO CNNS AND THEIR APPLICATIONS IN COMPUTERVISION:


Deep Learning has proved to be a very powerful tool because of its ability to handle large amounts of
data. The interest to use hidden layers has surpassed traditional techniques, especially in pattern
recognition. One of the most popular Deep neural networks is Convolutional Neural Networks (also
known as CNN or ConvNet) in deep learning, especially when it comes to Computer Vision applications .

Since the 1950s, the early days of AI, researchers have struggled to make a system that can understand
visual data. In the following years, this field came to be known as Computer Vision. In 2012, computer
vision took a quantum leap when a group of researchers from the University of Toronto developed an
AI model that surpassed the best image recognition algorithms, and that too by a large margin.

The AI system, which became known as AlexNet (named after its main creator, Alex Krizhevsky), won
the 2012 Image Net computer vision contest with an amazing 85 percent accuracy. The runner-up
scored a modest 74 percent on the test.

At the heart of AlexNet was Convolutional Neural Networks a special type of neural network that
roughly imitates human vision.
In 2012, Alex Krizhevsky realized that it was time to bring back the branch of deep learning that uses
multi-layered neural networks. The availability of large sets of data, to be more specific ImageNet
datasets with millions of labeled images and an abundance of computing resources enabled
researchers to revive CNNs

A. BACKGROUND OF CNNs

CNN’s were first developed and used around the 1980s. The most that a CNN could do at that time was
recognize handwritten digits. It was mostly used in the postal sectors to read zip codes, pin codes, etc.
The important thing to remember about any deep learning model is that it requires a large amount of
data to train and also requires a lot of computing resources. This was a major drawback for CNNs at
that period and hence CNNs were only limited to the postal sectors and it failed to enter the world of
machine learning.

In the past few decades, Deep Learning has proved to be a very powerful tool because of its ability to
handle large amounts of data. The interest to use hidden layers has surpassed traditional techniques,
especially in pattern recognition. One of the most popular deep neural networks is Convolutional
Neural Networks (also known as CNN or ConvNet) in deep learning, especially when it comes to
Computer Vision applications.

Over the years CNNs have become a very important part of many Computer Vision applications and
hence a part of any computer vision. So let’s take a look at the workings of CNNs or CNN algorithm in
deep learning.

 What Is a CNN?
 How does it work?
 What Is a Pooling Layer?
 Limitations of CNNs

B. WHAT IS A CNN
In deep learning, a Convolutional Neural Network (CNN/ConvNet) is a class of deep neural networks,
most commonly applied to analyze visual imagery Now when we think of a neural network we think
about matrix multiplications but that is not the case with ConvNet. It uses a special technique called
Convolution. Now in mathematics convolution is a mathematical operation on two functions that
produces a third function that expresses how the shape of one is modified by the other.

Bottom line is that the ConvNet role to reduce the images into a form that is easier to process, without
losing features crucial for good prediction.
C. HOW DOES CNN WORK

Before we go to the working of CNN’s let’s cover the basics such as what is an image and how is it
represented. An RGB image is nothing but a matrix of pixel values having three planes whereas a
grayscale image is the same but it has a single plane. Take a look at this image to understand more.

For simplicity, consider gray scale images to understand how CNNs work

The above image shows what a convolution is. We take a filter/kernel (3×3 matrix) and apply it to the
input image to get the convolved feature. This convolved feature is passed on to the next layer
In the case of RGB color, channel take a look at this animation to understand its working.

Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial neurons,
a rough imitation of their biological counterparts, are mathematical functions that calculate the
weighted sum of multiple inputs and outputs an activation value. When you input an image in a
ConvNet, each layer generates several activation functions that are passed on to the next layer.

The first layer usually extracts basic features such as horizontal or diagonal edges. This output is passed
on to the next layer which detects more complex features such as corners or combinational edges. As
we move deeper into the network it can identify even more complex features such as objects, faces,
etc.
Based on the activation map of the final convolution layer, the classification layer outputs a set of
confidence scores (values between 0 and 1) that specify how likely the image is to belong to a “class.”
For instance, if you have a ConvNet that detects cats, dogs, and horses, the output of the final layer is
the possibility that the input image contains any of those animals.
D. WHAT IS A POOLING LAYER
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the
Convolved Features. This is to decrease the computational power required to process the data by
reducing the dimensions. There are two types of pooling average pooling and max pooling.

In Max Pooling, the maximum value of a pixel from a portion of the image covered by the kernel is
found out. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations
altogether and also performs de-noising along with dimensionality reduction.

On the other hand, Average Pooling returns the average of all the values from the portion of the image
covered by the Kernel. Average Pooling simply performs dimensionality reduction as a noise
suppressing mechanism. Hence, We can say that Max Pooling performs a lot better than Average
Pooling.

E. BENEFITS OF USING CNNS FOR MACHINE AND DEEPLEARNING

Deep learning is a form of machine learning that requires a neural network with a minimum of three
layers. Networks with multiple layers are more accurate than single-layer networks. Deep learning
applications often use CNNs or RNNs (recurrent neural networks).
The CNN architecture is especially useful for image recognition and image classification, as well as
other computer vision tasks because they can process large amounts of data and produce highly
accurate predictions. CNNs can learn the features of an object through multiple iterations, eliminating
the need for manual feature engineering tasks like feature extraction.

It is possible to retrain a CNN for a new recognition task or build a new model based on an existing
network with trained weights. This is known as transfer learning. This enables ML model developers to
apply CNNs to different use cases without starting from scratch.

F. WHAT ARE CONVOLUTIONAL NEURAL NETWORKS (CNNs)

A Convolutional Neural Network (CNN) is a type of deep learning algorithm specifically designed for
image processing and recognition tasks. Compared to alternative classification models, CNNs require
less preprocessing as they can automatically learn hierarchical feature representations from raw input
images. They excel at assigning importance to various objects and features within the images through
convolutional layers, which apply filters to detect local patterns.

The connectivity pattern in CNNs is inspired by the visual cortex in the human brain, where neurons
respond to specific regions or receptive fields in the visual space. This architecture enables CNNs to
effectively capture spatial relationships and patterns in images. By stacking multiple convolutional and
pooling layers, CNNs can learn increasingly complex features, leading to high accuracy in tasks like
image classification, object detection, and segmentation.

G. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE MODEL

Convolutional neural networks are known for their superiority over other artificial neural networks,
given their ability to process visual, textual, and audio data. The CNN architecture comprises three
main layers: Convolutional layers, Pooling layers, and a Fully connected (FC) layer.

There can be multiple convolutional and pooling layers. The more layers in the network, the greater
the complexity and (theoretically) the accuracy of the machine learning model. Each additional layer
that processes the input data increases the model’s ability to recognize objects and patterns in the
data.

H. THE CONVOLUTIONAL LAYER

Convolutional layers are the key building block of the network, where most of the computations are
carried out. It works by applying a filter to the input data to identify features. This filter, known as a
feature detector, checks the image input’s receptive fields for a given feature. This operation is
referred to as convolution.

The filter is a two-dimensional array of weights that represents part of a 2-dimensional image. A filter
is typically a 3×3 matrix, although there are other possible sizes. The filter is applied to a region with in
the input image and calculates a dot product between the pixels, which is fed to an output array. The
filter then shifts and repeats the process until it has covered the whole image. The final output of all
the filter processes is called the feature map.
The CNN typically applies the ReLU (Rectified Linear Unit) transformation to each feature map after
every convolution to introduce nonlinearity to the ML model. A convolutional layer is typically followed
by a pooling layer. Together, the convolutional and pooling layers make up a convolutional block.

Additional convolution blocks will follow the first block, creating a hierarchical structure with later
layers learning from the earlier layers. For example, a CNN model might train to detect cars in images.
Cars can be viewed as the sum of their parts, including the wheels, boot, and wind screen. Each feature
of a car equates to a low-level pattern identified by the neural network, which then combines these
parts to create a high-level pattern.

I. THE POOLING LAYERS

A pooling or down sampling layer reduces the dimensionality of the input. Like a convolutional
operation, pooling operations use a filter to sweep the whole input image, but it doesn’t use weights.
The filter instead uses an aggregation function to populate the output array based on the receptive
field’s values.

(i). THERE ARE TWO KEY TYPES OF POOLING:


Averagepooling: The filter calculates the receptive field’s average value when it scans the input.
Max pooling: The filter sends the pixel with the maximum value to populate the output array. This
approach is more common than average pooling.

Pooling layers are important despite causing some information to be lost, because they help reduce
the complexity and increase the efficiency of the CNN. It also reduces the risk of overfitting.

J. THE FULLY CONNECTED (FC) LAYER


The final layer of a CNN is a fully connected layer. The FC layer performs classification tasks using the
features that the previous layers and filters extracted. Instead of ReLu functions, the FC layer typically
uses a softmax function that classifies inputs more appropriately and produces a probability score
between 0 and 1.

2 BASIC ARCHITECTURE OF CNN:


There are two main parts to a CNN architecture

A. A convolution tool that separates and identifies the various features of the image for analysis
in a process called as Feature Extraction.
B. The network of feature extraction consists of many pairs of convolutional or pooling layers.
C. A fully connected layer that utilizes the output from the convolution process and predicts the
class of the image based on the features extracted in previous stages.
D. This CNN model of feature extraction aims to reduce the number of features present in a
dataset. It creates new features which summarizes the existing features contained in an
original set of features. There are many CNN layers as shown in the CNN architecture
diagram.
User Input Histogram Gabor Enhancement Histogram
Image Size Equalization Image reshape size Equalization
480 * 320 480 * 320 350 * 223 480 * 320

Max. FC1
Conv
5x5 Pool FC2
Conv Flatter
2x2
3x3
Output

32 x 346 x 299 32 x 173 x 114 32 x 171 x 112


10 x 1
223

128 x 1

Basic Architecture of CNN

A. CONVOLUTION LAYERS - There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers. When these layers are stacked, a
CNN architecture will be formed. In addition to these three layers, there are two mor e important
parameters which are the dropout layer and the activation function which are defined below.

2.1 CONVOLUTIONAL LAYER


This layer is the first layer that is used to extract the various features from the input images. In this
layer, the mathematical operation of convolution is performed between the input image and a filter of
a particular size MxM. By sliding the filter over the input image, the dot product is taken between the
filter and the parts of the input image with respect to the size of the filter (MxM).

The output is termed as the Feature map which gives us information about the image such as the
corners and edges. Later, this feature map is fed to other layers to learn several other features of the
input image. The convolution layer in CNN passes the result to the next layer once applying the
convolution operation in the input. Convolutional layers in CNN benefit a lot as they ensure the spatial
relationship between the pixels is intact.

2.2 POOLING LAYER


In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim of this layer is to
decrease the size of the convolved feature map to reduce the computational costs. This is performed
by decreasing the connections between layers and independently operates on each feature map.
Depending upon method used, there are several types of Pooling operations. It basically summarises
the features generated by a convolution layer.

In Max Pooling, the largest element is taken from feature map. Average Pooling calculates the average
of the elements in a predefined sized Image section. The total sum of the element in the predefined
section is computed in SumPooling. The Pooling Layer usually serves as a bridge between the
Convolutional Layer and the FC Layer. This CNN model generalises the features extracted by the
convolution layer, and helps the networks to recognise the features independently. With the help of
this, the computations are also reduced in a network.
2.3 FULLYCONNECTEDLAYER
The Fully Connected (FC) layer consists of the weights and biases along with the neurons and is used to
connect the neurons between two different layers. These layers are usually placed before the output
layer and form the last few layers of a CNN Architecture.

In this, the input image from the previous layers are flattened and fed to the FC layer. The flattened
vector then undergoes few more FC layers where the mathematical functions operations usually take
place. In this stage, the classification process begins to take place. The reason two layers are connected
is that two fully connected layers will perform better than a single connected layer. These layers in
CNN reduce the human supervision

2.4 DROPOUT
Usually, when all the features are connected to the FC layer, it can cause over fitting in the training
dataset. Over fitting occurs when a particular model works so well on the training data causing a
negative impact in the model’s performance when used on a new data.

To overcome this problem, a dropout layer is utilized where in a few neurons are dropped from the
neural network during training process resulting in reduced size of the model. On passing a dropout of
0.3, 30% of the nodes are dropped out randomly from the neural network.

Dropout results in improving the performance of a machine learning model as it prevents over fitting
by making the network simpler. It drops neurons from the neural networks during training.

2.5 ACTIVATIONFUNCTIONS
Finally, one of the most important parameters of the CNN model is the activation function. They are
used to learn and approximate any kind of continuous and complex relationship between variables of
the network. In simple words, it decides which information of the model should fire in the forward
direction and which ones should not at the end of the network.

It adds non-linearity to the network. There are several commonly used activation functions such as
the ReLU, Softmax, tanH and the Sigmoid functions. Each of these functions have a specific usage. For a
binary classification CNN model, sigmoid and soft max functions are preferred and for a multi-class
classification, generally Softmax is used. In simple terms, activation functions in a CNN model
determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not to predict using mathematical operations.

A. Types of Neural Networks Activation Functions

a) Binary Step Function

Binary step function depends on a threshold value that decides whether a neuron should be activated
or not. The input fed to the activation function is compared to a certain threshold, if the input is
greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not
passed on to the next hidden layer
Mathematically, it can be represented as:

The limitations of binary step function are as follows:


 It cannot provide multi-value outputs—for example, it cannot be used for multi-class
classification problems.
The gradient of the step function is zero, which causes a hindrance in the back propagation process

b) Linear Activation Function:


The linear activation function, also known as "no activation" or "identity function"(multipliedx1.0), is
where the activation is proportional to the input. The function doesn't do anything to the weighted sum
of the input, it simply splits out the value it was given.

Mathematically, it can be represented as:


However, a linear activation function has two major problems:
 It’s not possible to use back propagation as the derivative of the function is a constant and has
no relation to the input x.
All layers of the neural network will collapse into one if a linear activation function is used. No matter the
number of layers in the neural network. The last layer will still be a linear function of the first layer. So,
essentially, a linear activation function turns the neural network into just one layer.

C) Non-Linear Activation Functions


The linear activation function shown above is simply a linear regression model. Because of its limited
power, this does not allow the model to create complex mappings between the network’s inputs and
outputs.

Non-linear activation functions solve the following limitations of linear activation functions:

 They allow backpropagation because now the derivative function would be related to the
input, and it’s possible to go back and understand which weights in the input neuron scan
provide a better prediction.
 They allow the stacking of multiple layers of neurons as the output would now be a non-linear
combination of input passed through multiple layers.
 Any output can be represented as a functional computation in a neural network.

Below are different non-linear neural networks activation functions and their characteristics.

C1) Sigmoid/Logistic Activation Function

This function takes any real value as input and outputs values in the range of 0 to 1. The larger the
input (morepositive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to 0.0, as shown below

Mathematically, it can be represented as:


Here’s why sigmoid/logistic activation function is one of the most widely used functions:

 It is commonly used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice
because of its range.

 The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.

The limitations of sigmoid function are discussed below:

The derivative of the function is f'(x)=sigmoid(x)*(1-sigmoid(x))

From the above Figure, the gradient values are only significant for range -3 to 3, and the graph gets
much flatter in other regions. It implies that for values greater than 3 or less than -3, the function
will have very small gradients. As the gradient value approaches zero, the network ceases to learn and
suffers from the Vanishing gradient problem.
 The output of the logistic function is not symmetric around zero. So the output of all the
neurons will be of the same sign. This makes the training of the neural network more difficult and
unstable.

C2) Tanh Function (Hyperbolic Tangent)


Tanh function is very similar to the sigmoid/logistic activation function, and even has the same S-shape
with the difference in output range of -1 to 1. In Tanh, the larger the input (more positive), the closer the
output value will be to 1.0, where as the smaller the input (more negative), the closer the output will be
to -1.0.
Mathematically, it can be represented as:

Advantages of using this activation function are:

 The output of the tanh activation function is Zero centered, hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
 Usually used in hidden layers of a neural network as its values lie between -1 to 1; therefore,
the mean for the hidden layer comes out to be 0 or very close to it. It help since entering the
data and makes learning for the next layer much easier.

It also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the
gradient of the tanh function is much steeper as compared to the sigmoid function.

Note: Al though both Sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and the
gradients are not restricted to move in a certain direction. Therefore, in practice, tanh non linearity is
always preferred to sigmoid non linearity.

C3) ReLU Function


ReLU stands for Rectified Linear Unit. Al though it gives an impression of a linear function, ReLU has a
derivative function and allows for backpropagation while simultaneously making it computationally
efficient.

The maincatch here is that the ReLU function does not activate all the neurons at the same time. The
neurons will only be deactivated if the output of the linear transformation is less than 0
Mathematically, it can be represented as:

The advantages of using ReLU as an activation function are as follows:


 Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
 ReLU accelerates the convergence of gradient descent towards the global minimum of the
loss function due to its linear, non-saturating property.

The limitations faced by this function are:


 The Dying ReLU problem.

The negative side of the graph makes the gradient value zero. Due to this reason, during the back
propagation process, the weights and biases for some neurons are not updated. This can create dead
neurons which never get activated.
 All the negative input values become zero immediately, which decreases the model’s ability to
fit or train from the data properly.

Note: For building the most reliable ML models, split your data into train, validation & test set
C4) Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small
positive slope in the negative area.

Mathematically, it can be represented as:

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
backpropagation, even for negative input values. By making this minor modification for negative input
values, the gradient of the left side of the graph comes out to be a non-zero [Link],we would
no longer encounterdead neurons in that region.

The limitations that this function faces include:


 The predictions may not be consistent for negative input values.
 The gradient for negative values is a small value that makes the learning of model parameters
time-consuming.
C5) Parametric ReLU Function
Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s becoming zero
for the left half o fthe [Link] function provides the slope of the negative part of the function as an
argument a. By performing back propagation, the most appropriate value of a is learnt.

Mathematically, it can be represented as:

Where "a" is the slope parameter for negative values. The parameterized ReLU function is used when
the leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information
is not successfully passed to the next layer. This function’s limitation is that it may perform differently
for different problems depending upon the value of slope parameter a

3. TYPES OF POOLING LAYERS:


A Convolutional neuralnetwork (CNN) is a special type of Artificial Neural Network that is usually used
for image recognition and processing due to its ability to recognize patterns in images. It eliminates the
need to extract features from visual data manually. It learns images by sliding a filter of some size on
them and learning not just the features from the data but also keeps Translation invariance.

The typical structure of a CNN consists of three basic layers


1. Convolutional layer: These layersgenerate a feature mapby sliding a filter over the input
image and recognizing patterns in images.
2. Poolinglayers: These layers down sample the feature map to introduce Translation invariance,
which reduces the overfitting of the CNN model.
3. Fully Connected Dense Layer:This layer contains the same number of units as the number of
classes and the output activation function such as “softmax” or “sigmoid”
3.1 What are Pooling layers?
Pooling layers are one of the building blocks of Convolutional Neural Networks. Where Convolutional
layers extract featuresfrom images, Pooling layers consolidate the featureslearned by CNNs. Its
purpose is to gradually shrink the representation’s spatial dimension to minimize the number of
parameters and computations in the network.

3.2 Why are Pooling layers needed?


The feature map produced by the filters of Convolutional layers is location-dependent. For example, If
an object in an image has shifted a bit it might not be recognizable by the Convolutional layer. So, it
means that the feature map records the precise positions of features in the input. What pooling layers
provide is “Translational Invariance” which makes the CNN invariant to translations, i.e., even if the
input of the CNN is translated, the CNN will still be able to recognize the features in the input.

In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.

3.3 How do Pooling layers achieve that?


A Pooling layer is added after the Convolutional layer(s), as seen in the structure of a CNN above. It
down samples the output of the Convolutional layers by sliding the filter of some size with some stride
size and calculating the maximum or average of the input.

There are two types of pooling that are used:


1. Max pooling: This works by selecting the maximum value from every pool. Max Pooling
retains the most prominent features of the feature map, and the returned image is sharper
than the original image.
2. Average pooling: This pooling layer works by getting the average of the pool. Average pooling
retains the average values of features of the feature map. It smoothes the image while
keeping the essence of the feature in an image

3.4 Max Pooling


Create a MaxPool2D layer with pool_size=2 and strides=2. Apply the MaxPool2D layer to the matrix,
and you will get the Max Pooled output in the tensor form. By applying it to the matrix, the Max
pooling layer will go through the matrix by computing the max of each 2×2 pool with a jump of 2. Print
the shape of the tensor. Use [Link] to remove dimensions of size 1 from the shape of a tensor.

3.5 Average Pooling


Create an Average Pooling 2D layer with the same 2 pool_size and strides. Apply the Average Pooling
2D layer to the matrix. By applying it to the matrix, the average pooling layer will go through the matrix
by computing the average of 2×2 for each pool with a jump of 2. Print the shape of the matrix and Use
[Link] to convert the output into a readable form by removing all 1 size dimensions.

The GIF here explains how these pooling layers go through the input matrix and computes the
maximum or average for max pooling and average pooling, respectively.
3.5 Global Pooling Layers
Global Pooling Layers often replace the classifier’s fully connected or Flatten layer. The model instead
ends with a convolutional layer that produces as many feature maps as there are target classes and
performs global average pooling on each of the feature maps to combine each feature map into a
single value.
Create the same NumPy array but with a different shape. By keeping the same shape as above, the
Global Pooling layers will reduce them to one value.

3.6 Global Average Pooling


Considering a tensor of shape h*w*n, the output of the Global Average Pooling layer is a single value
across h*w that summarizes the presence of the feature. Instead of down sizing the patches of the
input feature map, the Global Average Pooling layer downsizes the whole h*w into 1 value by taking
the average.

3.7 Global Max Pooling


With the tensor of shape h*w*n, the output of the Global Max Pooling layer is a single value across
h*w that summarizes the presence of a feature. Instead of down sizing the patches of the input feature
map, the Global Max Pooling layer down sizes the whole h*w into 1 value by taking the maximum

4. POPULAR CNN ARCHITECTURES: VGG, GoogleNet, ResNet:


Types of Convolutional Neural Network Algorithms

3.1 LeNet

LeNet is a pioneering CNN designed for recognizing hand written characters. It was proposed by Yann
LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner in the late 1990s. LeNet consists of a series of
convolutional and pooling layers, as well as a fully connected layer and softmax classifier. It was among
the first successful applications of deep learning for computer vision. It has been used by banks to
identify numbers written on cheques in grayscale input images.

3.2 VGG

VGG (Visual GeometryGroup) is a research group within the Department of Engineering Science at the
University of Oxford. The VGG group is well-known for its work in computer vision, particularly in the
area of convolutional neural networks (CNNs).

One of the most famous contributions from the VGG group is the VGG model, also known as VGGNet.
The VGG model is a deep neural network that achieved state-of-the-art performance on the ImageNet
Large Scale Visual Recognition Challenge in 2014, and has been widely used as a benchmark for image
classification and object detection tasks.

The VGG model is characterized by its use of small convolutional filters (3×3) and deep architecture (up
to 19 layers), which enables it to learn increasingly complex features from input images. The VGG
model also uses max pooling layers to reduce the spatial resolution of the feature maps and increase
the receptive field, which can improve its ability to recognize objects of varying scales and orientations.

The VGG model has inspired many subsequent research efforts in deep learning, including the
development of even deeper neural networks and the use of residual connections to improve gradient
flow and training stability.
3.3 ResNet

ResNet (short for “Residual Neural Network”) is a family of deep convolutional neural networks
designed to overcome the problem of vanishing gradients that are common in very deep networks. The
idea behind ResNet is to use “residual blocks” that allow for the direct propagation of gradients
through the network, enabling the training of very deep networks A residual block consists of two or
more convolutional layers followed by an activation function, combined with a shortcut connection
that bypasses the convolutional layers and adds the original input directly to the output of the
convolutional layers after the activation function.

This allows the network to learn residual functions that represent the difference between the
convolutional layers’ input and output, rather than trying to learn the entire mapping directly. The use
of residual blocks enables the training of very deep networks, with hundreds or thousands of layers,
significantly alleviating the issue of vanishing gradients.

3.4 GoogLeNet

GoogLeNet is a deep convolutional neural network developed by researchers at Google. It was


introduced in 2014 and won the ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) that year,
with a top-five error rate of 6.67%.

GoogLeNet is notable for its use of the Inception module, which consists of multiple parallel
convolutional layers with different filter sizes, followed by a pooling layer, and concatenation of the
outputs. This design allows the network to learn features at multiple scales and resolutions, while
keeping the computational cost manageable. The network also includes auxiliary classifiers at
intermediate layers, which encourage the network to learn more discriminative features and prevent
overfitting. GoogLeNet builds upon the ideas of previous convolutional neural networks, including
LeNet, which was one of the first successful applications of deep learning in computer vision. However,
GoogLeNet is much deeper and more complex than LeNet

4. DROPOUT:
The term “dropout” refers to dropping out the nodes (input and hidden layer) in a neural network. All
the forward and backwards connections with a dropped node are temporarily removed, thus creating a
[Link] a dropout probability of p.

Consider given input x:{1,2,3,4,5} to the fully connected layer. We have a dropout layer with
probability p = 0.2 (or keep probability = 0.8). During the forward propagation (training) from the input
x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on.
Similarly, it applied to the hidden layers.

For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).

Generally, for the input layers, the keep probability, i.e. 1- drop probability, is closer to 1, 0.8 being the
best as suggested by the authors. For the hidden layers, the greater the drop probability more s parse
the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.
How does Dropout solve the Overfitting problem?
In the overfitting problem, the model learns the statistical noise. To be precise, the main motive of
training is to decrease the loss function, given all the units (neurons). So in overfitting, a unit may
change in a way that fixes up the mistakes of the other units. This leads to complex co-adaptations,
which inturn leads to the overfitting problem because this complex co-adaptation fails to generalise on
the unseen dataset.

Now, if we use dropout, it prevents these units to fix up the mistake of other units, thus preventing co-
adaptation,as in every iteration, the presence of a unit is highly unreliable. So, by randomly dropping a
few units (nodes), it forces the layers to take more or less responsibility for the input by taking a
probabilistic approach.

This ensures that the model is getting generalised and hence reducing the overfitting problem.

Figure2:(a)Hiddenlayerfeatureswithoutdropout; (b)Hiddenlayerfeatureswithdropout

From figure2, we can easily make out that the hidden layer with dropout is learning more of the
generalised features than the co-adaptations in the layer without dropout. It is quite apparent, that
dropout breaks such inter-unit relations and focuses more on generalization.

Evolution of Convolutional Neural Network Models


Convolutional Neural Networks (CNNs) have progressed from early proof-of-concepts for digit
recognition to highly optimized architectures for large-scale vision tasks. Each generation introduced
innovations in depth, connectivity, and efficiency.
LeNet-5 (1998) pioneered deep learning for vision by stacking two convolutional and two pooling
layers before fully connected layers to recognize 32×32-pixel handwritten digits. It demonstrated
weight sharing and local receptive fields but was limited to simple, low-resolution tasks.
AlexNet (2012) reignited CNN research by using eight layers with ReLU activations, dropout, and GPU-
accelerated training to win ILSVRC2012. It reduced top-5 error from 26.2 % to 15.3 % on ImageNet,
showcasing the power of deep nets and large datasets.
ZFNet (2013) refined AlexNet by replacing the first 11×11 filters with 7×7 kernels and visualizing
internal activations to guide architectural tweaks. This yielded improved accuracy without increasing
depth, highlighting the importance of spatial resolution in early layers.
VGGNet (2014) demonstrated that depth via uniform 3×3 convolutions—stacked in 16–19 layers—
could outperform wider filters. Its modular simplicity aided transfer learning but incurred 138 M
parameters and heavy compute requirements.
GoogLeNet (2014) introduced the Inception module, executing parallel 1×1, 3×3, and 5×5 convolutions
plus pooling in 22 layers. By factorizing convolutions and using global average pooling, it achieved
strong accuracy with only 6.8 M parameters.
ResNet (2015) solved vanishing gradients in ultra-deep nets by adding identity “shortcut” connections,
enabling architectures of up to 152 layers. Residual blocks allowed stable training and set new
benchmarks across vision tasks.
Inception-ResNet (2016) hybridized Inception modules with residual connections, combining multi-
scale feature extraction and accelerated convergence. It matched state-of-the-art accuracy with faster
training than pure Inception or deep ResNets.
Xception (2017) generalized Inception by replacing blocks with depthwise separable convolutions—
separating spatial and cross-channel filtering—to boost parameter efficiency and marginally improve
accuracy on ImageNet.
SqueezeNet (2016) delivered AlexNet accuracy with 50× fewer parameters by using “fire” modules:
1×1 squeeze layers followed by parallel 1×1 and 3×3 expand layers, targeting embedded deployment.
ShuffleNet (2017) further optimized mobile inference by combining pointwise group convolutions with
channel shuffling to enable cross-group information flow, achieving competitive accuracy at very low
computational cost.

Comparative Analysis
Model Depth Params Merits Demerits Applications
Simple, efficient for Limited to small
Digit/OCR
LeNet-5 7 ~60 K low-res digits, inputs; poor
recognition
lightweight scalability
ReLU, dropout, GPU High memory and
training; large-scale compute; General image
AlexNet 8 ~60 M
ImageNet moderate classification
breakthrough overfitting
Improved early-layer
Still heavy; few Image
ZFNet 8 ~62 M resolution; activation
structural changes classification
visualization
Uniform design; Very large; slow
Feature extraction;
VGG-16/19 16/19 138 M excels in transfer inference and
transfer learning
learning training
Inception modules;
Complex design;
parameter-efficient; Classification;
GoogLeNet 22 6.8 M harder to
multi-scale feature detection
implement
capture
Residual
connections; easy Classification;
ResNet- 25 M/60 Still sizable; higher
50/152 training of ultra-deep segmentation;
50/152 M inference latency
nets detection

Combines inception
efficiency with
Very complex;
Inception- residual ease of High-accuracy
~164 ~55 M integration
ResNet training vision tasks
challenges
Depthwise separable Requires careful
Mobile/embedded
Xception 36 23 M convs; improved tuning; marginal
classification
efficiency gains over ResNet
AlexNet accuracy
Lower absolute
with 50× fewer IoT devices;
SqueezeNet 18 1.2 M accuracy; limited
params; ideal for mobile inference
feature learning
embedded systems
Group conv + Slight accuracy Smartphones;
ShuffleNet 50 1.0 M channel shuffle; very drop; complex edge-AI; on-device
low compute cost grouping vision

You might also like