0% found this document useful (0 votes)
3 views66 pages

LectureNote CNN

The document discusses the limitations of flattening image matrices for neural networks, highlighting issues like excessive parameters and loss of spatial relationships. It introduces convolutional layers and filters as a solution, allowing networks to learn features more efficiently while maintaining local pixel relationships. Additionally, it covers pooling layers, CNN architecture, and advanced techniques like residual connections and batch normalization to improve model performance.

Uploaded by

armagangulal.561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views66 pages

LectureNote CNN

The document discusses the limitations of flattening image matrices for neural networks, highlighting issues like excessive parameters and loss of spatial relationships. It introduces convolutional layers and filters as a solution, allowing networks to learn features more efficiently while maintaining local pixel relationships. Additionally, it covers pooling layers, CNN architecture, and advanced techniques like residual connections and batch normalization to improve model performance.

Uploaded by

armagangulal.561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Last Lecture, we flatten the image matrix into a long vector and feed it to a

dense layer.

• We need to learn “too many” parameters


• Flattening a 3024 ×3024 – pixel color image (from your phone) and
connecting to a single 100-neuron dense layer generates approximately ?
parameters.
his is computationally demanding, very data-hungry and increases the risk of
overfitting
We lose the local spatial adjacency relationships between pixels that define
features of the image.
• We don’t learn once and reuse repeatedly
• If a feature of the image (e.g., a vertical line or a circle) appears in
different places in the image, the network should “learn it once and use it
again and again” rather learn it separately each time.
Convolutional layers were developed to address these shortcomings
Convolutional Filters
A convolutional filter is a small square matrix of numbers

By choosing the numbers in a filter carefully and “applying” the filter to an


image, different features of the image can be detected
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

• InputTake 3x3 region


• Multiply element-wise
• Sum values
• Move kernel (stride=1)
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 3
1 2 1 0 0 1 0 -1 = -1 0 1
0 1 3 1 2 * 1 0 -1 2 -2 -1
2 0 1 3 1

(1×1 + 2×0 + 3×-1)+ (0×1 + 1×0 + 2×-1)+ (1×1 + 2×0 + 1×-1) = -4


convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(2*1 + 3*0 + 0*(-1)) + (1*1 + 2*0 + 3*(-1)) + (2*1 + 1*0 + 0*(-1))=2


convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(3*1 + 0*0 + 1*(-1)) + (2*1 + 3*0 + 1*(-1)) + (1*1 + 0*0 + 0*(-1)) = 4


convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(0*1 + 1*0 + 2*(-1)) + (1*1 + 2*0 + 1*(-1)) + (0*1 + 1*0 + 3*(-1)) = -5


convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(1*1 + 2*0 + 3*(-1))


+ (2*1 + 1*0 + 0*(-1))
+ (1*1 + 3*0 + 1*(-1))
=0
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(2*1 + 3*0 + 1*(-1))


+ (1*1 + 0*0 + 0*(-1))
+ (3*1 + 1*0 + 2*(-1))
=3
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(1*1 + 2*0 + 1*(-1))


+ (0*1 + 1*0 + 3*(-1))
+ (2*1 + 0*0 + 1*(-1))
= -2
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(2*1 + 1*0 + 0*(-1))


+ (1*1 + 3*0 + 1*(-1))
+ (0*1 + 1*0 + 3*(-1))
= -1
convolution operation
1 2 3 0 1
0 1 2 3 2 1 0 -1 -4 2 4
1 2 1 0 0 1 0 -1 = -5 0 3
0 1 3 1 2 * 1 0 -1 -2 -1 2
2 0 1 3 1

(1*1 + 0*0 + 0*(-1))


+ (3*1 + 1*0 + 2*(-1))
+ (1*1 + 3*0 + 1*(-1))
=2
Output Size
• (N - F + 2P)/S + 1
• Example: (5-3+0)/1 +1 = 3
• F: Filter Size
• P: Padding
• S: Stride (Step Size)
Convolutional Layers
A convolutional layer is composed of one or
more convolutional filters

1 1 1
0 0 0
-1 -1 -1

Each filter can be thought of as a specialist for detecting a particular


feature (e.g., a horizontal line, an arc, a vertical line)
Applying a Convolutional Layer to
a color image

If we had instead applied f filters, the output would be a tensor with


shape 4 x 4 x f

mage source: [Link]


• These filters seem excellent but how are we supposed to
come up with the numbers in each filter?
• In fact, convolutional filters used to be designed by hand.
Computer Vision researchers invested a lot of effort in
devising filters that could detect various types of image
features
• As we figured out how to train deep networks with lots of
weights, a big idea emerged: think of the numbers in the
filter as weights and simply learn them from the data,
just like we learn all the other weights
• This is possible because a convolutional filter is just
a neuron
• Therefore, our entire machinery – neurons, layers,
loss functions, gradient descent – is perfectly applicable
As a result, a network with many convolutional layers
can learn increasingly complex features
Demos

• [Link]

• [Link]
s/demo/[Link]
Pooling Layers
• Pooling layers (also called down-sampling or
subsampling layers) reduce the size of the tensor
coming out of a convolutional layer
In average pooling, we take the average of each 2x2 box
[Link]
• Max pooling acts like an “OR” condition: if a
feature exists anywhere in its input, max-pooling
will pick it up i.e., max-pooling acts like a feature
detector

• Since successive convolutional layers can “see”


more and more of the original input image, the
max-pooling layers that follow them can detect if
a feature exists in more and more of the original
input image as well
The architecture of a basic CNN

[Link]
The architecture of a CNN

Each convolutional block typically has 1-2 convolutional layers followed by a


pooling layer
The final tensor gets flattened into a long vector and sent through 0 or more
hidden layers to the output layer

[Link]
ConvNet architecture patterns
• The modularity-hierarchy-reuse formula for
model architecture
• An overview of standard best practices for
building ConvNets: residual connections,
batch normalization, and depthwise
separable convolutions
• Ongoing design trends for computer vision
models
• Deep learning model architecture is
primarily about making clever use of
modularity, hierarchy, and reuse.
• You’ll notice that all popular ConvNet
architectures are not only structured into
layers, they’re structured into repeated
groups of layers (called blocks or modules).
• Deeper hierarchies are intrinsically good
because they encourage feature reuse and,
therefore, abstraction.
• In general, a deep stack of narrow layers
performs better than a shallow stack of large
layers. However, there’s a limit to how deep
you can stack layers: the problem
of vanishing gradients.
• This leads us to our first essential model
architecture pattern: residual connections.
Residual connections
• the game of telephone
where an initial message is whispered in the ear of
a player, who then whispers it in the ear of the
next player, and so on.
The final message ends up bearing little
resemblance to its original version.
As it happens, backpropagation in a
sequential deep learning model is pretty
similar to the game of telephone. You’ve got
a chain of functions, like this one:
y = f4(f3(f2(f1(x))))
the game of telephone. You’ve got a chain of functions, like this one:
y = f4(f3(f2(f1(x))))

The name of the game is to adjust the parameters of each function in the chain
based on the error recorded on the output of f4 (the loss of the model).
To adjust f1, you’ll need to percolate error information through f2, f3, and f4.

However, each successive function in the chain introduces some amount of


noise in the process. If your function chain is too deep, this noise starts
overwhelming gradient information, and backpropagation stops working.

Your model won’t train at all. This is called the vanishing gradients problem.
• The fix is simple: just force each function in the
chain to be nondestructive — to retain a
noiseless version of the information contained in
the previous input.
• The easiest way to implement this is called
a residual connection. It’s dead easy: just add the
input of a layer or block of layers back to its
output (see figure 9.3).
• The residual connection acts as an information
shortcut around destructive or noisy blocks (such
as blocks that contain ReLU activations or
dropout layers), enabling error gradient
information from early layers to propagate
noiselessly through a deep network.
• This technique was introduced in 2015 with the
ResNet family of models (developed by He et al.
at Microsoft).[1]
• Note that adding the input back to the output of a block
implies that the output should have the same shape as
the input. This is not the case if your block includes
convolutional layers with an increased number of filters
or a max pooling layer. In such cases, use a 1 ×
1 Conv2D layer with no activation to linearly project the
residual to the desired output shape.
• Note that adding the input back to the output of
a block implies that the output should have the
same shape as the input. In such cases, use a 1
× 1 Conv2D layer with no activation to linearly
project the residual to the desired output shape.
• if your block includes convolutional layers with an
increased number of filters or a max pooling layer.
• Typically use padding="same" in the convolution
layers in your target block to avoid spatial
downsampling due to padding, and you’d use strides
in the residual projection to match any
downsampling caused by a max pooling layer.
Batch normalization
• It’s a type of layer (BatchNormalization in Keras)
introduced in 2015 by Ioffe and Szegedy;[2] it can
adaptively normalize data even as the mean and
variance change over time during training.
• During training, it uses the mean and variance of
the current batch of data to normalize samples,
and during inference (when a big enough batch
of representative data may not be available),
• it uses an exponential moving average of the
batchwise mean and variance of the data seen
during training.
• In practice, the main effect of batch
normalization appears to be that it helps with
gradient propagation — much like residual
connections — and thus allows for deeper
networks.
• Some very deep networks can only be trained if
they include multiple BatchNormalization layers.
• For instance, batch normalization is used
liberally in many of the advanced ConvNet
architectures that come packaged with Keras,
such as ResNet50, EfficientNet, and Xception.
the main effect of batch normalization appears to
be that it helps with gradient propagation —
much like residual connections — and thus
allows for deeper networks. Some very deep
networks can only be trained if they include
multiple BatchNormalization layers. For
instance, batch normalization is used liberally in
many of the advanced ConvNet architectures
that come packaged with Keras, such as
ResNet50, EfficientNet, and Xception.
The BatchNormalization layer can be used
after any layer — Dense, Conv2D, and so on:
• Both Dense and Conv2D involve a “bias vector,”
a learned variable whose purpose is to make the
layer affine rather than purely linear. For
instance, Conv2D returns, schematically, y =
conv(x, kernel) + bias,
and Dense returns y = dot(x, kernel) +
bias. Because the normalization step will take
care of centering the layer’s output on zero, the
bias vector is no longer needed when
using BatchNormalization, and the layer can
be created without it via the
option use_bias=False. This makes the layer
slightly leaner.
Depthwise separable
convolutions

What if we told you that there’s a layer you can use as a drop-in
replacement for Conv2D that will make your model smaller (fewer
trainable weight parameters), leaner (fewer floating-point
operations), and cause it to perform a few percentage points
better on its task? That is precisely what the depthwise separable
convolution layer does (SeparableConv2D in Keras). This layer
performs a spatial convolution on each channel of its input,
independently, before mixing output channels via a pointwise
convolution (a 1 × 1 convolution), as shown in figure 9.4.
Depthwise separable
convolutions
• Consider a regular convolution operation with a
3 x 3 window, 64 input channels, and 64 output
channels. It uses 3 × 3 × 64 × 64 = 36,864
trainable parameters, and when you apply it to
an image, it runs a number of floating-point
operations that is proportional to this parameter
count.
• Meanwhile, consider an equivalent depthwise
separable convolution: it only involves 3 × 3 × 64
+ 64 × 64 = 4,672 trainable parameters and
proportionally fewer floating-point operations.
This efficiency improvement only increases as
the number of filters or the size of the
convolution windows gets larger.

You might also like