Deep Learning
Prof. FATIMA-EZZAHRAA BEN-BOUAZZA
Lecture 2: Training neural networks
ESM6ISS , UM6SS
Prepared by:
PhD. EMSSAAD Ilyass
PhD. CHAKOUR EL MEZALI Manal
1
Plan for Today
How to optimize parameters efficiently?
• Optimizers
• Initialization
• Normalization
2
Optimizers
3 / 61
Empirical risk minimization
d = arg min L(θ) =
1 N
θ∗ ∑ ℓ(yn , f (x n ; θ)).
θ N n=1
4 / 61
A practical recommendation
Training a massive deep neural network is long, complex and sometimes
confusing.
A fi rs t step towards understanding, debugging and optimizing neural networks
is to make use of visualization tools for
• plotting losses and metrics,
• visualizing computational graphs,
• or showing additional data as the
network is being trained.
5 / 61
Weights & Biases ([Link])
6 / 61
Let me say this once again: plot your
losses.
7 / 61
Gradient descent
To minimize L(θ) , standard batch gradient descent (GD) consists in applying the
update rule
N
1
gt = ∑ ∇θ ℓ(yn , f (x n ; θ t ))
N n=1
θt+1 = θt − γg t ,
where γ is the learning rate.
8 / 61
9 / 61
Stochastic gradient descent
While it makes sense to compute the gradient exactly,
it takes time to compute and becomes inefficient for large N ,
it is an empirical estimation of an hidden quantity (the expected risk), and
any partial sum is also an unbiased estimate, although of greater variance.
10 / 61
To reduce the computational complexity, stochastic gradient descent (SGD)
consists in updating the parameters after every sample
gt = ∇θ ℓ(yn(t), f (x n(t) ; θ t ))
θt+1 = θt − γg t .
11 / 61
0:00 / 0:15
12 / 61
While being computationally faster than batch gradient descent,
gradient estimates used by SGD can be very noisy, which may help escape
from local minima;
but SGD does not benefit from the speed-up of batch-processing.
13 / 61
Mini-batching
Instead, mini-batch SGD consists in visiting the samples in mini-batches and
updating the parameters each time
1 B
gt = B ∑ ∇θ ℓ(yn(t,b), f (x n(t,b) ; θ t ))
b=1
θt+1 = θt − γg t ,
where the order n(t, b) to visit the samples can be either sequential or random.
• Increasing the batch size B reduces the variance of the gradient
estimates and enables the speed-up of batch processing.
• The interplay between B and γ is still unclear.
14 / 61
Limitations
Gradient descent makes strong assumptions about
the magnitude of the local curvature to set the step size,
the isotropy of the curvature, so that the same step size γ makes sense
in all directions.
15 / 61
0:00 / 0:15
γ = 0.01
16 / 61
0:00 / 0:15
γ = 0.01
17 / 61
0:00 / 0:15
γ = 0.1
18 / 61
0:00 / 0:15
γ = 0.4
19 / 61
Wolfe conditions could be used to design line search algorithms to automatically
determine a step size γ t , hence ensuring convergence towards a local minima.
However, in deep learning,
these algorithms are impractical because of the size of the parameter space
and the overhead it would induce,
they might lead to overfitting when the empirical risk is minimized too well.
20 / 61
The tradeoffs of large-scale learning
A fundamental result due to Bottou and Bousquet (2011) states that stochastic
optimization algorithms (e.g., SGD) yield the best generalization performance (in
terms of excess error) despite being the worst optimization algorithms for
minimizing the empirical risk.
That is, for a fi xe d computational budget, stochastic optimization
algorithms reach a lower test error than more sophisticated algorithms
(2nd order methods, line search algorithms, etc) that would fit the training error
too well or would consume too large a part of the computational budget at every
step.
21 / 61
22 / 61
Momentum
In the situation of small but consistent gradients, as through valley floors,
gradient descent moves very slowly.
23 / 61
An improvement to gradient descent is to use momentum to add inertia in the
choice of the step direction, that is
u t = αut−1 − γgt
θt+1 = θt + u t .
The new variable u t is the velocity. It
corresponds to the direction and speed by
which the parameters move as the
ut
learning dynamics progresses, modeled αut−1
as an exponentially decaying moving
average of negative gradients.
– γgt
Gradient descent with momentum has
three nice properties:
- it can go through local barriers,
- it accelerates if the gradient does not change much,
- it dampens oscillations in narrow valleys.
25 / 61
The hyper-parameter α controls how recent gradients affect the current update.
Usually, α = 0.9, with α > γ .
If at each update we observed g, the step would (eventually) be
γ
u = − 1 − α g.
Therefore, for α = 0.9, it is like multiplying the maximum speed by 10
relative to the current direction.
25 / 61
0:00 / 0:15
26 / 61
Nesterov momentum
An alternative consists in simulating a step in the direction of the velocity, then
calculate the gradient and make a correction.
N
1
gt = ∑ ∇θ ℓ(yn , f (x n ; θt + αut−1 ))
N n=1
u t = αut−1 − γgt
θt+1 = θt + u t
– γgt
αut−1
ut
27 / 61
0:00 / 0:15
28 / 61
Adaptive learning rate
Vanilla gradient descent assumes the isotropy of the curvature, so that the same
step size γ applies to all parameters.
Isotropic vs. Anistropic
29 / 61
AdaGrad
Per-parameter downscale by square-root of sum of squares of all its historical
values.
r t = rt−1 + gt ⊙ gt
γ
θt+1 = θt − ⊙ gt .
δ + rt
AdaGrad eliminates the need to manually tune the learning rate. Most
implementation use γ = 0.01 as default.
It is good when the objective is convex.
r t grows unboundedly during training, which may cause the step size to
shrink and eventually become in nitesimally small.
30 / 61
RMSProp
Same as AdaGrad but accumulate an exponentially decaying average of the
gradient.
r t = ρrt−1 + (1 − ρ)gt ⊙ gt
γ
θt+1 = θt − δ + r ⊙ gt .
t
Perform better in non-convex settings.
Does not grow unboundedly.
31 / 61
Adam
Similar to RMSProp with momentum, but with bias correction terms for the first
and second moments.
s t = ρ1 st−1 + (1 − ρ1 )gt
𝑆𝑡
𝑆𝑡 =
1 − ρ1t
r t = ρ2 rt−1 + (1 − ρ2 )gt ⊙ gt
rt
𝑟ො𝑡 =
1 − ρ2t
𝑠ො𝑡
θt+1 = θt − γ
δ+ 𝑟ො𝑡
Good defaults are ρ1 = 0.9 and ρ2 = 0.999.
Adam is one of the default optimizers in deep learning, along with SGD
with momentum.
32 / 61
0:00 / 0:15
33 / 61
Weight decay
Weight decay is a regularization technique that penalizes large weights.
For vanilla SGD, it is equivalent to adding a penalty term to the loss function
λ
ℓθ + ∣∣θ∣∣2.
2
For more complex optimizers, it is equivalent to adding a penalty term
to the update rule
θt+1 = θt − γ (gt + λθ) .
34 / 61
Training without (left) and with (right) weight decay.
36 / 61
Learning rate
36 / 61
Scheduling
Despite per-parameter adaptive learning rate methods, it is usually helpful to
anneal the learning rate γ over time.
Step decay: reduce the learning rate by some factor every few epochs (e.g,
by half every 10 epochs).
Exponential decay: γ t = γ0 exp(−kt) where γ0 and k are hyper-
parameters.
1/t decay: γ t = γ0 /(1 + kt) where γ0 and k are hyper-parameters.
37 / 61
Step decay scheduling for training ResNets.
38 / 61
Warmup and cyclical schedules
39 / 61
Initialization
40 / 61
In convex problems, provided a good learning rate γ, convergence is guaranteed
regardless of the initial parameter values.
In the non-convex regime, initialization is much more important! Little is known on
the mathematics of initialization strategies of neural networks.
• What is known: initialization should break symmetry.
• What is known: the scale of weights is important.
41 / 61
Controlling for the variance in the forward pass
A first strategy is to initialize the network parameters such that activations
preserve the same variance across layers.
Intuitively, this ensures that the information keeps flowing during the forward
pass, without reducing or magnifying the magnitude of input signals
exponentially.
43 / 61
Let us assume that
we are in a linear regime at initialization (e.g., the positive part of a ReLU
or the middle of a sigmoid),
weights wlij are initialized i.i.d,
biases bl are initialized to be 0,
input features are i.i.d, with a variance denoted as V[x].
Then, the variance of the activation hil of unit i in layer l is
ql−1 −1 𝑙 𝑙 1
V ℎ𝑙 = 𝑉 σ 𝑤𝑖𝑗 ℎ𝑗 −
𝑖 j=0
ql−1 −1
=σ 𝑉 𝑤 𝑙 𝑉 ℎ 𝑙 −1
j=0 𝑖𝑗 𝑗
where ql is the width of layer l and hj0 = x j for all j = 0, ..., p − 1.
44 / 61
Since the weights wlij at layer l share the same variance V 𝑤 𝑙 and the
variance of the activations in the previous layer are the same, we can drop the
indices and write
V ℎ 𝑙 = ql−1V 𝑤 𝑙 V ℎ 𝑙 −1
Therefore, the variance of the activations is preserved across layers when
1
V 𝑤𝑙 = ∀l
ql−1
This condition is enforced in LeCun's uniform initialization, which is defined as
3 , 3
wlij ~ U [− ]
ql−1 ql−1
45 / 61
Controlling for the variance in the backward pass
A similar idea can be applied to ensure that the gradients flow in the backward
pass (without vanishing nor exploding), by maintaining the variance of the
gradient with respect to the activations fixed across layers.
Under the same assumptions as before,
ql+1 −1
d𝑦Ƹ d𝑦Ƹ ∂hl+1
V[ l] = V[ ∑ j
l ]
dhi j=0
dhl+1
j
∂h i
ql+1 −1
d𝑦Ƹ
=V[ ∑ +1 w l+1
]
j=0
dhlj j,i
ql+1 −1
d𝑦Ƹ
= ∑ V[ +1
] V [ wl+1
]
j=0
dhlj ji
46 / 61
If we further assume that
• the gradients of the activations at layer l share the same variance
• the weights at layer l + 1 share the same variance V[w l+1],
then we can drop the indices and write
d𝑦Ƹ d𝑦Ƹ
V [ l ] = ql+1 V [ l+1 ] V [wl+1 ].
dh dh
Therefore, the variance of the gradients with respect to the activations is
preserved across layers when
1
V [wl ] = ∀l.
ql
47 / 61
Xavier initialization
We have derived two different conditions on the variance of w l ,
1
V[wl ] =
𝑞𝑙−1
1
V[wl ] =
𝑞𝑙
A compromise is the Xavier initialization, which initializes w l randomly from a
distribution with variance
1 2
V[wl ] = ql−1 +ql
= .
2
ql−1 + ql
For example, normalized initialization is defined as
6 , 6
wlij ~ U [− ].
ql−1 + ql ql−1 + ql
48 / 61
49 / 61
50 / 61
He initialization
Because ReLU(x) = max(0, x), the mean of the activations at layer l is
typically not 0. Therefore, our zero-mean assumption is wrong. Accounting for
this shift, He et al (2015) derive a forward initialization scheme that initializes wl
from a distribution with variance
2
V[wl ] = .
ql−1
51 / 61
52 / 61
Normalization
53 / 61
Data normalization
Previous weight initialization strategies rely on preserving the activation
variance constant across layers, under the assumption that the input feature
variances are the same. That is,
V [x i ] = V [x j ] ≜ V [x]
for all pairs of features i, j .
56 / 61
In general, this constraint is not satisfied but can be enforced by standardizing
the input data feature-wise,
1
x ′ = (x −μ)Ƹ ⊙ ,
σෝ
where
1 1
μƸ = ∑ x σෝ 2 = ∑ ( x − μ)Ƹ 2 .
N x∈d N x∈d
57 / 61
Batch normalization
Maintaining proper statistics of the activations and derivatives is critical for
training neural networks.
This constraint can be enforced explicitly during the forward pass by re-
normalizing them. Batch normalization was the first method introducing this
idea.
58 / 61
Let us consider a minibatch of samples at training, for which ub ∈ Rq ,
b = 1, ..., B , are intermediate values computed at some location in the
computational graph.
In batch normalization following the node u, the per-component mean and
variance are first computed on the batch
1 B σො 2 = 1 B
μƸ batch = B ∑ u b batch ∑ ( u b − μƸ batch)2 ,
b=1
B
b=1
from which the standardized u′b ∈ Rq are computed such that
1
u′b = γ ⊙ (u b − μƸ batch) ⊙ +β
σෝ batch +ϵ
where γ, β ∈ Rq are parameters to optimize.
During testing, the mean and variance computed on the entire training set and
used to standardize the activations.
59 / 61
60 / 61
Layer normalization
Layer normalization is a variant of batch normalization that normalizes the
activations across the features of each sample, rather than across the samples
of each feature:
1
u′ = γ ⊙ (u − μƸ layer ) ⊙ + β.
σො layer + ϵ
61 / 61
The
end.
61 / 61