Classic CNN Architectures Overview
Classic CNN Architectures Overview
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
To calculate the gradients of error ‘E’ with respect to the
filter ‘F’, the following equations needs to solved.
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
Which evaluates to-
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
If we look closely the previous equation can be written in
form of our convolution operation.
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
Similarly we can find the gradients of the error ‘E’ with
respect to the input matrix ‘X’.
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
The previous computation can be obtained by a different
type of convolution operation known as full convolution.
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
Backpropagation of max pooling
Suppose you have a matrix M of four elements:
a b
c d
and maxpool(M) returns d.
Then, the maxpool function really only depends on d.
So, the derivative of maxpool relative to d is 1, and its
derivative relative to a,b,c is zero. So you
backpropagate 1 to the unit corresponding to d, and
you backpropagate zero for the other units.
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
Objectives
• We will examine classic CNN architectures
with the goal of:
- Gaining intuition for building CNNs
- Reusing CNN architectures
LeNet-5
• Gradient Based Learning Applied To Document
Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner;
1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
FC FC
... ^
𝑦
⋮ ⋮
10
5516
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5
• Only 60K parameters
• As we go deeper in the network:
• General structure:
conv->pool->conv->pool->FC->FC->output
• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture AlexNet
CONV1
• Input: 227x227x3 images (224x224 before
MAX POOL1
padding)
NORM1
CONV2 • First layer: 96 11x11 filters applied at stride 4
MAX POOL2
NORM2 • Output volume size?
CONV3 (N-F)/s+1 = (227-11)/4+1 = 55 ->
CONV4 [55x55x96]
CONV5
Max POOL3
• Number of parameters in this layer?
FC6
FC7 (11*11*3)*96 = 35K
FC8
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
FC FC
...
⋮ ⋮
Softmax
1000
4096 4096
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
• Trained on GTX 580 GPU with only 3 GB of memory.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012]
AlexNet
AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
• Very Deep Convolutional Networks For Large Scale
Image Recognition - Karen Simonyan and Andrew
Zisserman; 2015
• The runner-up at the ILSVRC 2014 competition
• Significantly deeper than AlexNet
• 140 million parameters
VGGNet
3x3 conv, 64
3x3 conv, 64
Pool 1/2
3x3 conv, 128
3x3 conv, 128 • Smaller filters
Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1
3x3 conv, 256
3x3 conv, 256 and 2x2 MAX POOL , stride 2
Pool 1/2
3x3 conv, 512
3x3 conv, 512 • Deeper network
3x3 conv, 512
Pool 1/2 AlexNet: 8 layers
3x3 conv, 512 VGGNet: 16 - 19 layers
3x3 conv, 512
3x3 conv, 512
Pool 1/2 • ZFNet: 11.7% top 5 error in ILSVRC’13
FC 4096
FC 4096 • VGGNet: 7.3% top 5 error in ILSVRC’14
FC 1000
Softmax
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
•Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Reminder: Receptive Field
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64
Pool
VGGNet
3x3 conv, 128
3x3 conv, 128 VGG16:
Pool
3x3 conv, 256 TOTAL memory: 24M * 4 bytes ~= 96MB / image
3x3 conv, 256 TOTAL params: 138M parameters
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
Pool memory: 56*56*128=400K params: 0
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
Pool memory: 28*28*256=200K params: 0
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool memory: 14*14*512=100K params: 0
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Pool memory: 7*7*512=25K params: 0
FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448
FC 4096 memory: 4096 params: 4096*4096 = 16,777,216
FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
Details/Retrospectives :
• ILSVRC’14 2nd in classification, 1st in localization
• Similar training procedure as AlexNet
• No Local Response Normalisation (LRN)
• Use VGG16 or VGG19 (VGG19 only slightly better, more
memory)
• Use ensembles for best results
• FC7 features generalize well to other tasks
• Trained on 4 Nvidia Titan Black GPUs for two to three weeks.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
VGGNet
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet
• Going Deeper with Convolutions - Christian Szegedy et
al.; 2015
• ILSVRC 2014 competition winner
• Also significantly deeper than AlexNet
• x12 less parameters than AlexNet
• Focused on computational efficiency
Previous layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Naïve Inception Model
• Apply parallel filter operations on the input :
• Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
• Pooling operation (3x3)
• Concatenate all filter outputs together depth-wise
Filter
concatenation
1x1 3x3 5x5 3x3 max
convolution convolution convolution pooling
Previous layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• What’s the problem with this?
High computational complexity
Filter
concatenation
1x1 3x3 5x5 3x3 max
convolution convolution convolution pooling
Previous layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• Output volume sizes:
1x1 conv, 128: 28x28x128
3x3 conv, 192: 28x28x192
Example:
5x5 conv, 96: 28x28x96 Filter
3x3 pool: 28x28x256 concatenation
3x3 max
1x1 conv 128 3x3 conv 192 5x5 conv 96
pooling
Previous layer
• What is output size after 28x28x256
filter concatenation?
28x28x(128+192+96+256) = 28x28x672
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• Number of convolution operations:
1x1 conv, 128: 28x28x128x1x1x256
3x3 conv, 192: 28x28x192x3x3x256
5x5 conv, 96: 28x28x96x5x5x256
Total: 854M ops
Filter
concatenation
3x3 max
1x1 conv 128 3x3 conv 192 5x5 conv 96
pooling
Previous layer
28x28x256
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• Very expensive compute!
• Pooling layer also preserves feature
depth, which means total depth after
concatenation can only grow at every layer.
Filter
concatenation
3x3 max
1x1 conv 128 3x3 conv 192 5x5 conv 96
pooling
Previous layer
28x28x256
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• Solution: “bottleneck” layers that use 1x1 convolutions to
reduce feature depth (from previous hour).
Filter
concatenation
1x1 3x3 5x5 3x3 max
convolution convolution convolution pooling
Previous layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
• Solution: “bottleneck” layers that use 1x1 convolutions to
reduce feature depth (from previous hour).
Filter
concatenation
1x1 3x3 5x5 1x1
convolution convolution convolution convolution
Previous layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
• Number of convolution operations:
1x1 conv, 64: 28x28x64x1x1x256
1x1 conv, 64: 28x28x64x1x1x256
1x1 conv, 128: 28x28x128x1x1x256
3x3 conv, 192: 28x28x192x3x3x64
5x5 conv, 96: 28x28x96x5x5x264
1x1 conv, 64: 28x28x64x1x1x256 Filter
Total: 353M ops concatenation
1x1 conv 128 3x3 conv 192 5x5 conv 96 1x1 conv 64
3x3 max
1x1 conv 64 1x1 conv 64
pooling
Previous layer
28x28x256
• Compared to 854M ops for naive version
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and
exploding gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
[He et al., 2015]
ResNet
• ILSVRC’15 classification winner (3.57% top 5
error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
a
[l ]
𝐋𝐢𝐧𝐞𝐚𝐫 𝐑𝐞𝐋𝐔 𝐋𝐢𝐧𝐞𝐚𝐫 𝐑𝐞𝐋𝐔 a [l +2]
[l +1]
a
[𝐥 +𝟏] [ 𝐥 +𝟏 ] [𝐥 ] [ 𝐥 +𝟏 ] [𝐥 +𝟐] [ 𝐥 +𝟐 ] [𝐥 +𝟏] [ 𝐥 + 𝟐]
𝐳 ¿𝐖 𝐚 +𝐛 𝐳 =𝐖 𝐚 +𝐛
[ 𝐥 +𝟏] [ 𝐥 +𝟏 ]
𝐚 = 𝐠( 𝐳 ) 𝐚 [ 𝐥 +𝟐]
= 𝐠( 𝐳
[ 𝐥 +𝟐 ]
)
)
[He et al., 2015]
ResNet
Full ResNet architecture:
• Stack residual blocks
• Every residual block has two 3x3 conv layers
• Periodically, double # of filters and
downsample spatially using stride 2 (in each
dimension)
• Additional conv layer at the beginning
• No FC layers at the end (only FC 1000 to
output classes)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power consumption
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Summary
• LeNet-5
• AlexNet
• VGG
• GoogleNet – Inception module
• ResNet – Residual block
References
• Gradient-based learning applied to document recognition; ann
LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner; 1998
• ImageNet Classification with Deep Convolutional Neural Networks -
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012
• Very Deep Convolutional Networks For Large Scale Image
Recognition - Karen Simonyan and Andrew Zisserman; 2015
• Going Deeper with Convolutions - Christian Szegedy et al.; 2015
• Deep Residual Learning for Image Recognition - Kaiming He,
Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
• Stanford CS231- Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9
• Coursera, Machine Learning course by Andrew Ng.
References
• The 9 Deep Learning Papers You Need To Know About
(Understanding CNNs Part 3) by Adit Deshpande https://
[Link]/[Link]/The-9-Deep-Learnin
[Link]
• CNNs Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and
more … By Siddharth Das [Link]
siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlene
t-resnet-and-more-666091488df5
• Slide taken from Forward And Backpropagation in Convolutional
Neural Network. – Medium , By Sujit Rai
[Link]
n-in-convolutional-neural-network-4dfa96d7b37e
Thank You.