0% found this document useful (0 votes)
14 views13 pages

ImageNet ConvNet Architectures Overview

Uploaded by

mahyaouiyasmin1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

ImageNet ConvNet Architectures Overview

Uploaded by

mahyaouiyasmin1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ImgeNet ConvNet Architectures

Objectives
Learn about different existing ConvNet network architectures like LeNet 5, AlexNet, VGG-16,
ResNets, Inception and Xception.
Learn how to use transfer learning.

Introduction
- ImageNet1 project is a large visual database designed for use in Visual Object Recognition software
research. ImageNet runs an annual software ImageNet contest called Large Scale Visual Recognition
Competition (ILSVRC)2, where software programs compete to correctly classify and detect objects and
scenes.

Figure 1: The evolution of the winning entries on the ImageNet Large Scale Visual Recognition Challenge from 2010 to 2015

- You should rarely ever have to train a CNN from scratch or design one from scratch. You should look
at whatever architecture works best on ImageNet, download a pre-trained model and fine tune it to your
data.

1 [Link]
2 [Link]
1
Convolutional Neural Networks - Case studies
A N.N architecture that works well on one computer vision task often works on other computer vision tasks.
Many ideas from the case studies are cross-fertilizing on making their way into other disciplines. We discuss
hereafter the perhaps most known CNN architectures.

LeNet-5
- One of the top ILSVRC competitors (LeCun, Bottou, Bengio, & Haffner, 1998).
- LeNet-5, a pioneering 7-level convolutional Neural Network that classifies digits. It was applied by
several banks to recognize hand-written digits on checks digitized in 32x32 pixel grayscale input
images.
- The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by
a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier as
shown by Figure 2.

- The input for LeNet-5 is a 32×32 grayscale image which passes through the first convolutional layer
with 6 filters having size 5×5 and a stride of one. The image dimensions changes from 32x32x1 to
28x28x6.

- Then the LeNet-5 applies average pooling layer or sub-sampling layer with a filter size 2×2 and a stride
of two. The resulting image dimensions will be reduced to 14x14x6.

- Next, there is a second convolutional layer with 16 filters having size 5×5 and a stride of 1.

- Then the LeNet-5 applies average pooling layer or sub-sampling layer with a filter size 2×2 and a stride
of two. The resulting image dimensions will be reduced to 5x5x16.

- LeNet-5 is small compared to N.N standards nowadays, it has around 60K parameters.

- It uses non-linearity after pooling using Sigmoid/tanh.

Figure 2: LeNet-5 Network architecture

2
AlexNet

- AlexNet is based on the work of (Krizhevsky, Sutskever, & Hinton, 2012) for image
classification, won the ILSVRC 2012 competition
- Input images are 227x227x3.
- A CONV layer with 96 filters that are 11x11, 𝑤𝑖𝑡ℎ 𝑠 = 4 → 55x55x96
- Max-pool 3x3, 𝑠 = 2 → 27 x 27 x 96
- CONV layer with 5x5 filters, “same” padding, 𝑓𝑖𝑙𝑡𝑒𝑟 → 27x27x256
- Max pooling 3x3, 𝑠 = 2 → 13 x 13 x 256
- A CONV layer with 384 filters that are 3x3, 𝑠𝑎𝑚𝑒 𝑝𝑎𝑑𝑑𝑖𝑛𝑔 → 13x13x384
- CONV layer with 3x3 filters, 13x13x384
- CONV layer with 3x3 filters → 13x13x256
- Max-pool, 3x3, 𝑠 = 2 → 6𝑥6𝑥256
- A FC with 9216 units.
- Another FC with 4096 units.
- Another FC with 4096 units.
- Softmax 10000 units.

Figure 3: AlexNet

- AlexNet has a lot of similarities with LeNet-5, but much bigger, 160M parameters.
- Used ReLU non-linearities.
- Trained on 2 GPUs.
- It has a lot of hyperparameters.
VGG – 16
o Instead of having many hyperparameters, VGG-16 (Simonyan & Zisserman, 2014) uses a simplified
network where:

3
Figure 4: VGG-16 network

o CONV layers use 3x3 filter, s=1, same padding


o Max-pooling 2x2, s=2, 𝑛𝐻 and 𝑛𝑤 decrease while 𝑛𝑐 increases by a factor of 2.
o 16 refers to the 16 layers that have weights ≈ 138M (pretty large network).
o VGG-19, winner of the ILSVRC 2014 competition, is a bigger network, however VGG-
16 performs almost the same as VGG-19
ResNets
o When deeper networks are able to start converging, a degradation problem has been exposed: with
the network depth increasing, accuracy gets saturated (which might be unsurprising) and then
degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more
layers to a suitably deep model leads to higher training error, Figure 5 shows a typical example.
o There exists a solution by construction to the deeper model: the added layers are identity mapping,
and the other layers are copied from the learned shallower model. The existence of this constructed
solution indicates that a deeper model should produce no higher training error than its shallower
counterpart.

4
Figure 5: Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error.

o ResNets (Deep residual learning for image recognition) are built out of something called residual
block (He, Zhang, Ren, & Sun, 2015). It’s the winner of ILSVRC 2015.
o In residual nets we are going to make a change to the main path, we take 𝑎[𝑙] and fast forward it.
𝑧 [𝑙+1] = 𝑤 [𝑙+1] 𝑎[𝑙] + 𝑏 [𝑙+1]
𝑎[𝑙+1] = 𝑓(𝑧 [𝑙+1] )
𝑧 [𝑙+2] = 𝑤 [𝑙+2] 𝑎[𝑙+1] + 𝑏 [𝑙+2]
𝑎[𝑙+2] = 𝑔(𝑧 [𝑙+2] )
o The last equation goes away and instead we have:
𝑎[𝑙+2] = 𝑔(𝑧 [𝑙+2] + 𝑎[𝑙] )
o The shortcut is added (𝑎[𝑙] is injected) before the non-linearity, an example of a residual block is
depicted by Figure 6 .

Figure 6: Residual learning building block

o Sometimes instead of the term “shortcut”, we use the term “skip connection”.
o The authors found that using residual blocks allows you to train much deeper Neural Networks.
o You build a ResNet by taking a number of residual blocks and stacking them to form a N.N
o If you take a “plain network”, you can turn it into a ResNet by adding shortcuts every two layers,
you turn each of these into a residual block.

5
o If you use a standard optimization algorithm to train a plain network, empirically you find as you
increase the number of layers the training error will decrease after a while, but then it will tend to
go up. With ResNet, even if the number of layers gets deeper, the training set error keeps going
down.
o Reason: Taking these intermediate activations and allowing them to go much deeper in the network
helps with the vanishing/exploding gradient problems and allows you to train much deeper N.N
without appreciable loss in performance, though it might plateau at some point (you can’t get deeper
and deeper).

Figure 7: On the bottom, a plain network with 34 parameter layers (3.6 billion FLOPs). On the top, a residual network
with 34 parameter layers

Why do Residual Networks work?


- The identity mapping is sufficient for addressing the degradation problem:
- Assumption using 𝑅𝑒𝐿𝑈 → 𝑎[𝑙] ≥ 0
- So 𝑎[𝑙+2] = 𝑓(𝑧 [𝑙+2] + 𝑎[𝑙] ) = 𝑓(𝑤 [𝑙+2] 𝑎[𝑙+1] + 𝑏 [𝑙+2] + 𝑎[𝑙] )
- If you are using L2-regularization or weight decay, it will tend to shrink the value of 𝑤 [𝑙+2].
Let’s say for the sake of argument 𝑤 [𝑙+2] ≈ 0 and 𝑏 [𝑙+2] ≈ 0 (If you apply decay to b as well):
- We assumed a ReLU, so 𝑎[𝑙] ≥ 0 then:
𝑔(𝑎[𝑙] ) = 𝑎[𝑙] (ReLU applied to non-negative quantity)
- So the identity function is easy for a residual block to learn (in the worst case).
- But of course, you don’t want to just not hurt performance, we want to improve performance. You
can imagine if these hidden units are able to learn something different from the identity function,
you can allow the optimization algorithm to improve the solution from here.
- Through the addition: 𝑧 [𝑙+2] + 𝑎[𝑙] we should make sure that 𝑧 [𝑙+2] and 𝑎[𝑙] have the same
dimension.

6
- So, what you see in ResNet is a lot of Same Convolution so as to match the dimension of 𝑎[𝑙+2],
because the same convolution preserves dimension of the input and makes it possible to carry out
the addition.
- In case 𝑎[𝑙] , 𝑧 [𝑙+2] /𝑎[𝑙+2] have different dimensions, for example 𝑎[𝑙] ∈ ℝ128 , but 𝑧 [𝑙+2] ∈ ℝ256 ,
add an extra matrix 𝑤 𝑠 ∈ ℝ256x128 and compute:
𝒂[𝒍+𝟐] = 𝒈(𝒛[𝒍+𝟐] + 𝒘𝒔 𝒂[𝒍] )
- 𝒘𝒔 could be a matrix of parameters to be learnt or could be a fixed matrix that just implements zero
padding (by taking 𝑎[𝑙] and zero padding to be 256 dimensional).
- 3x3 same convolution, the dimension is preserved.
- CONV layers followed by pooling layers, whenever these happen you need to make an adjustment
to the dimension.
Inception Networks

The Inception network was an important milestone in the development of CNN classifiers. Prior to inception
networks, most popular CNNs just stacked convolution layers deeper and deeper, hoping to get better
performance.

There’s a simple but powerful way of creating better deep learning models. You can just make a bigger
model, either in terms of deepness, i.e., number of layers, or the number of neurons in each layer. But as
you can imagine, this can often create complications:

• The bigger the model, it becomes more prone to overfitting. This is particularly noticeable when
the training data is small
• Increasing the number of parameters means you need to increase your existing computational
resources
• Difficult to propagate the gradients
• Large variation in size makes it difficult to decide what filters to use

Large variation in size

- Salient parts in the image can have extremely large variation in size. For instance, an image with a dog
can be either of the following, as shown in Figure 8. The area occupied by the dog is different in each
image.

- Because of this huge variation in the location of the information, choosing the right kernel size for the
convolution operation becomes tough. A larger kernel is preferred for information that is distributed
more globally, and a smaller kernel is preferred for information that is distributed more locally.

7
Figure 8: A dog occupying most of the image, a dog occupying a part of it, and a dog occupying very little space (source)

- An Inception network is a more complex ConvNet architecture where you can do all types of filters
(Szegedy et al., 2015). Do them all (several convolutions, max-pooling) at the same time and
concatenate (as shown in Figure 9).

Figure 9: Inception module

- Problem: Computational cost, using the example from Figure 10:


In order to output 28x28x32
We convolve 28x28x192 by 32 filters, each one 5x5x192, we perform:
28x28x32 ∗ 5x5x192 ≈ 120M

Figure 10: Example with 120 M computations

- You can do 120M in modern computers, but this is still high computation.

8
- Solution use a Network in a Network.

What is Network In Network (NIN)

- Let’s say you have 28𝑥28𝑥192 tensor, to shrink the dimension you can use max pooling, but what if
you want to shrink the number of channels? (To save computation in some network).
o Use convolutions that are 1𝑥1, the idea is called “Network In Network” (Lin, Chen, & Yan,
2013). An example is shown by Figure 11.

Figure 11: Network In Network example (equivalent to the network in Figure 7) with 12.4M computations

- Using the idea of 1x1 convolutions (NIN), you will be able to reduce the computational cost by a factor
of 3 to 10:
- Reduce the volume (the depth) using 1x1x192 convolutions and 16 kernels:
(28x28x192) ∗ (1x1x192) using 16 filters → 28x28x16
- On this small volume we use 5x5x16 convolutions, by applying 32 kernels
to get 28x28x32.
- Input and output dimensions are still the same: 28x28x192 and 28x28x32, but the
input volume is shrinking in the intermediate step: sometimes this is called a
“bottleneck layer”.
- Computational cost of 1st convolutional layer (1x1) NIN:
(28x28x16) * (1x1x192) = 2.4M
nd
- Cost of 2 convolutional layer:
(28x28x32) * (5x5x16) = 10M
1 1
- The sum is 12.4𝑀 ≈ 10 (120𝑀) → 10
𝑜𝑓 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑐𝑜𝑠𝑡

- To wrap up: with the inception layer you don’t have to decide the type of layer, you can do them all and
concatenate the results. At the expense of running into the problem of the high computational cost,

9
however, using a 1x1 convolution you can create a bottleneck layer, thus, reducing the computational
cost significantly.

Inception architecture

o Inception Network consists of inception models stacked upon each other, with occasional max-pooling
layers with stride 2 to halve the resolution of the grid.
o It allows for increasing the number of units at each stage significantly without uncontrolled blow-up in
computational complexity at later stages.

Figure 12: Inception Network filter concatenation using NIN

o The design follows the practical intuition that visual information should be processed at various scales
and then aggregated so that the next stage can abstract features from the different scales simultaneously.
o Applying different filter sizes (e.g. 1×1, 3×3, 5×5, or 3×3 pooling) would normally produce feature
maps of varying spatial dimensions, but Inception solves this by controlling padding and stride so all
outputs share the same size and can be concatenated depth-wise.
o Inception networks allow for controlled balancing of computational resources resulting in Networks
that are 3-10x faster than similarly performing networks with non-Inception architecture; however, this
requires carful manual design.
o GoogLeNet has 9 such inception modules stacked linearly. It is 22 layers deep (27, including the pooling
layers). It uses global average pooling at the end of the last inception module.
o Needless to say, it is a pretty deep classifier. As with any very deep network, it is subject to the vanishing
gradient problem. Given relatively large depth of the network, the ability to propagate gradients back
through all layers in an effective manner was a concern. The features produced by the layers in the
middle of the network should be very discriminative.

10
- By adding auxiliary classifiers connected to these intermediate layers, discrimination in the
lower stages in the classifier was expected. This was thought to combat vanishing gradients
problem while providing regularization.
- During training, their loss gets added to the total loss of the network with a discount weight
(The losses of the auxiliary classifiers were weighted by 0.3). At inference time these auxiliary
networks are discarded.
- Later control experiments have shown that the effect of auxiliary networks is relatively minor
(around 0.5%) and that it required only one of them to achieve the same effect.

Figure 13: Inception network (a layer view)

o The 1st version of Inception, dubbed “GoogleNet” was the 22nd winner of ILSVRC 2014 competition.
o Inception V2 , V3 were developed in a second paper later and improved on the original in so many ways
– most notably by refactoring larger convolutions into consecutive smaller ones that were easier to learn.
In V3 , e.g., the 5x5 convolution was replaced with 2 consecutive 3x3 convolutions.
o The latest version of Inception, namely version V4 , even threw in residual blocks within each module,
creating a hybrid Inception-ResNet network.

11
Transfer learning
- In 90% or more of applications, you should not have to worry about ConvNet architectural decisions.
Instead of rolling your own architecture for a problem, you should look at whatever architecture
currently works best on ImageNet, download a pre-rained model and fine-tune it on your data. You
should rarely ever have to train a ConvNet from scratch or design one from scratch. If you are building
a computer vision task/application, rather than training from scratch, you can often make much faster
progress if you download learnt weights that someone else has already trained on the network
architecture and use that by transferring it to a new task.
- Examples of Datasets for computer vision posted online: MS COCO, IMAGENET, PASCAL
- You can download weights that took other researchers weeks/months and GPUs to train and transfer
them to your problem. Many programming frameworks, like Keras, support this mode of operation.
- You might get pretty good performance on this even with a small dataset.
- There are four approaches to do transfer learning:
1. When you have a small training set you might freeze the downloaded parameters and train only
the parameters of your own Softmax layer. One other trick that might help speed up training is
that you could precompute the activations and save them the hard disk (it’s a fixed function).
So, you end up with a shallow network like Softmax model to train. (Because all weights in the
other layers are not trainable). You don’t need to compute those activations through every epoch
over the training set.
2. Another approach in case you have a larger training set is to freeze the early layers and train
your layers deep in the network besides the softmax layer. There are two ways to do that:
a) Take the last few layers and use the trained weights as initialization and do gradient
descent from there.
b) Throw these last layers and use your own layers and train the weights from scratch.
3. If you have a lot of data, in the extreme case, you could use the downloaded weights as
initialization and train the whole network from scratch (use gradient descent to update all
parameters in the network).

References
Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. arXiv: 161002357.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, vol. abs/1512.03385.

12
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural
networks. Paper presented at the Advances in neural information processing systems.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278-2324.
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Szegedy, C., Wei, L., Yangqing, J., Sermanet, P., Reed, S., Anguelov, D., . . . Rabinovich, A. (2015, 7-12 June 2015).
Going deeper with convolutions. Paper presented at the 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).

13

You might also like