DENSELY CONNECTED CONVOLUTIONAL NETWORKS
Presentation by :
MariaWaheed ( l1f18bscs0460)
Farrukh Alam Virk ( l1f18bscs0424)
WHAT ARE COVERED IN THIS PRESENTATION
Dense Block
DenseNet Architecture
Advantages of DenseNet
CIFAR & SVHN Small-scale Dataset Results
ImageNet Large-Scale Dataset Results
Further Analysis on Feature Reuse
STANDARD
CONNECTIVITY
Dense Block:
A Dense Block is a module used in convolutional neural networks that connects all
layers (with matching feature-map sizes) directly with each other. To preserve the feed-
forward nature, each layer obtains additional inputs from all preceding layers and passes
on its own feature-maps to all subsequent layers.
In Standard ConvNet, input image goes through multiple convolution and obtain high-level
features.
R E S NET CONNECTIV ITY
Identity mappings promote gradient
propagation.
: E lement-wise addition
In ResNet, identity mapping is proposed to promote the gradient propagation. Element-wise addition is used.
It can be viewed as algorithms with a state passed from one ResNet module to another one.
DE NSE ARCHITECTURE
DE NSE
CONNECTIVITY
C C C C
C : Channel-wise
concatenation
In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all
subsequent layers. Concatenation is used. Each layer is receiving a “collective knowledge” from all preceding layers.
DE NSE AND S LIM
C C C C
k channels k channels k channels k channels
k : Growth
Rate
Since each layer receives feature maps from all preceding layers, network can
be thinner and compact, i.e. number of channels can be fewer.
The growth rate k is the additional number of channels for each layer.
FORWARD
PROPAGATION
x1 h x2 x3 x4 x
x0 h 1 x0 2 h3 h x2
1 2
x xx1 0
x x 4 3
x 1
x
0
0
So, it have higher computational efficiency and memory efficiency. The following figure shows the
concept of concatenation during forward propagation
DenseNet Architecture:
Basic DenseNet Composition Layer:
For each composition layer, Pre-Activation Batch Norm (BN) and ReLU, then 3×3 Conv are done with output feature maps
of k channels, say for example, to transform x0, x1, x2, x3 to x4. This is the idea from Pre-Activation ResNet.
Convolution (3x3)
Batch Norm
x3x x4
x1
ReL
3
0 xx1
U
2 x 0
x 2
x
k
channels
x5 =h5([x0, …, x4])
DenseNet-B (Bottleneck Layers):
To reduce the model complexity and size, BN-ReLU-1×1 Conv is done before BN-ReLU-3×3 Conv.
Convolution (1x1)
Convolution (3x3)
x4
Batch Norm
Batch Norm
x3x
ReL
ReL
x 1x
U
2
0
lxk 4xk k
channels channels channels
Higher parameter and computational
efficiency
MULTIPLE DENSE BLOCKS WITH TRANSITION LAYERS:
1×1 CONV FOLLOWED BY 2×2 AVERAGE POOLING ARE USED AS THE TRANSITION LAYERS BETWEEN TWO
CONTIGUOUS DENSE BLOCKS.
FEATURE MAP SIZES ARE THE SAME WITHIN THE DENSE BLOCK SO THAT THEY CAN BE CONCATENATED TOGETHER
EASILY.
AT THE END OF THE LAST DENSE BLOCK, A GLOBAL AVERAGE POOLING IS PERFORMED AND THEN A SOFTMAX
CLASSIFIER IS ATTACHED.
Dense Block 1 Dense Block 2 Dense Block 3
Convolution
Convolution
Convolution
Pooling
Pooling
Pooling
Linea
Output
r
Pooling reduces Feature map sizes match
feature map sizes within each block
DENSENETS-B
DenseNets-B are just regular DenseNets that take advantage of 1x1 convolution to reduce the feature
maps size before the 3x3 convolution and improve computing efficiency. The B comes after the name
Bottleneck layer you are already familiar with from the work on ResNets.
DenseNet-BC (Further Compression):
If a dense block contains m feature-maps, The transition layer generate θm output feature
maps, where 0<θ≤1 is referred to as the compression factor.
When θ=1, the number of feature-maps across transition layers remains unchanged. DenseNet with
θ<1 is referred as DenseNet-C, and θ=0.5 in the experiment.
When both the bottleneck and transition layers with θ<1 are used, the model is referred
as DenseNet-BC.
Finally, DenseNets with/without B/C and with different L layers and k growth rate are trained.
DenseNets-C are another little incremental step to DenseNets-B, for the
cases where we would like to reduce the number of output feature maps.
The compression factor (theta) determines this reduction. Instead of having
m feature maps at a certain layer, we will have theta*m. Of course, is in the
range [0–1]. So DenseNets will remain the same when theta=1, and will be
DenseNets-B otherwise.
ADVANTAGES OF
DENSENET
ADVANTAGE 1: STRONG GRADIENT
FLOW
Error
Signal
The error signal can be easily propagated to earlier layers more
directly. This is a kind of implicit deep supervision as earlier layers
can get direct supervision from the final classification layer.
ADVANTAGE 2: PARAMETER & COMPUTATIONAL
EFFICIENCY
For each layer, number of parameters in ResNet is directly proportional to C×C while Number of
parameters in Dense Net is directly proportional to l×k×k
ResNet connectivity: #parameters:
Input s Output
t ure
fea
at ed
r rel hl O(CxC)
Co
C C
DenseNet connectivity: k<<C
Input
ures
eat Output
ifie df
ver
s O(lxkxk)
Di k: Growth rate
lX hl
k
k
ADVANTAGE 3: MAINTAINS LOW COMPLEXITY
FEATURES
Standard Connectivity:
Classifier uses most complex (high level)
features
w4 y = w4h4(x)
x h1(x) h2(x) h3(x) h4(x) classifier
In Dense Net, classifier uses features of all complexity
levels. It tends to give more smooth decision
boundaries. It also explains why Dense Net performs
well when training data is insufficient.
Increasingly complex
features
ADVANTAGE 3: MAINTAINS LOW COMPLEXITY
FEATURES
Dense Connectivity:
w0 y = w 0x +
Classifier uses features of all complexity
levels w1 +w1h1(x)
w2 +w2h2(x)
w3 +w3h3(x)
C C C C w4
+w4h4(x)
x h1(x) h2(x) h3(x) h4(x) classifier
In DenseNet, classifier uses features of all complexity levels. It tends to give more smooth decision
boundaries. It also explains why DenseNet performs well when training data is insufficient.
Increasingly complex
features
RESULTS
RESULTS ON C I FA R -
10
ResNet (110 Layers, 1.7 M) ResNet (1001 Layers, 10.2 M)
DenseNet (100 Layers, 0.8 M) DenseNet (250 Layers, 15.3 M)
W i t h data augmentation Without data augmentation
12.0 12.0
11.0 11.0 11.26
10.0 10.0 10.56
9.0 9.0 Previous
8.0 8.0 SOTA
Test Error
7.3
7.0 7.0
6.0 6.41 Previous 6.0
(%)
SOTA 5.9
5.0 5.0 5.2
4.62
4.0 4.5 4.2 4.0
3.6
3.0 3.0
2.0 2.0
With data augmentation (C10+), test
error:
•Small-size ResNet-110: 6.41%
•Large-size ResNet-1001 (10.2M parameters): 4.62%
•State-of-the-art (SOTA) 4.2%
•Small-size Dense Net-BC (L=100, k=12) (Only 0.8M parameters):
4.5%
•Large-size Dense Net (L=250, k=24): 3.6%
Without data augmentation (C10),
test error:
•Small-size ResNet-110: 11.26%
•Large-size ResNet-1001 (10.2M parameters): 10.56%
•State-of-the-art (SOTA) 7.3%
•Small-size Dense Net-BC (L=100, k=12) (Only 0.8M parameters):
5.9%
•Large-size Dense Net (L=250, k=24): 4.2%
RESULTS ON C IFA R -
100
ResNet (110 Layers, 1.7 M) ResNet (1001 Layers, 10.2 M)
DenseNet (100 Layers, 0.8 M) DenseNet (250 Layers, 15.3 M)
W i t h data augmentation Without data augmentation
35.0 35.0 35 .5 8
33.47 Previous
30.0 30.0 SOTA
28.2
27.22 Previous
25.0 25.0
SOTA
Test Error
24.2
22.71 22.3
20.0 20.5 20.0
(%)
19.6
17.6
15.0 15.0
10.0 10.0
DETAIL RESULTS:
SVHN is the Street View House Numbers dataset. The blue
color means the best result. Dense Net-BC cannot get a
better result than the basic Dense Net, authors argue that
SVHN is a relatively easy task, and extremely deep models
may overfit the training set.
RESULTS ON
I M A GEN ET
DenseNet ResNet DenseNet ResNet
28.0 28.0
ResNet-34 ResNet-34
26.0 26.0
DenseNet-121 DenseNet-121
Top-1 error (%)
Top-1 error (%)
ResNet-50 ResNet-50
24.0 24.0
DenseNet-169 DenseNet-169
DenseNet-201ResNet-101 DenseNet-201 ResN et-101
ResNet-152 ResNet-152
22.0 22.0
DenseNet-264
DenseNet-264
DenseNet-264(k=48) DenseNet-264(k=48)
20.0 20.0
10
23
16
29
3
20
80
40
60
0
# Parameters (M) GFLOPs
Top-1: 20.27%
Top-5: 5.17%
MULTI-SCALE (Preview
DENSENET )
Classifier 1 Classifier 2 Classifier 3 Classifier 4 …
cat: 0.2 cat: 0.4 cat: 0.6
0.2 ≱ 0.4 ≱ 0.6 > threshold
threshold threshold
MULTI-SCALE (Preview
DENSENET )
Test …
Input
Inference Speed:
…
~ 2.6x faster than ResNets
~ 1.3x faster than DenseNets
…
Classifier 1 Classifier 2 Classifier 3 Classifier 4 …
“Easy” “Hard”
examples examples
CONVOLUTIONAL
NETWORKS
LeNet AlexNet
VGG Inception
ResNet