0% found this document useful (0 votes)

11 views59 pages

Optimizing Neural Network Training Techniques

The lecture discusses optimizing parameters in neural networks, focusing on techniques such as optimizers, initialization, and normalization. Key topics include gradient descent methods, the importance of visualization tools, and various optimization algorithms like SGD, Adam, and weight decay. Additionally, it emphasizes the significance of proper initialization strategies and data normalization to ensure effective training of deep learning models.

Uploaded by

omaryyasmine922

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views59 pages

Optimizing Neural Network Training Techniques

Uploaded by

omaryyasmine922

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Learning

Prof. FATIMA-EZZAHRAA BEN-BOUAZZA

Lecture 2: Training neural networks

ESM6ISS , UM6SS

Prepared by:
PhD. EMSSAAD Ilyass
PhD. CHAKOUR EL MEZALI Manal
1
Plan for Today
How to optimize parameters efficiently?

• Optimizers
• Initialization
• Normalization

2
Optimizers

3 / 61
Empirical risk minimization

d = arg min L(θ) =

1 N
θ∗ ∑ ℓ(yn , f (x n ; θ)).
θ N n=1

4 / 61
A practical recommendation
Training a massive deep neural network is long, complex and sometimes
confusing.

A fi rs t step towards understanding, debugging and optimizing neural networks

is to make use of visualization tools for

• plotting losses and metrics,

• visualizing computational graphs,

• or showing additional data as the

network is being trained.

5 / 61
Weights & Biases ([Link])

6 / 61
Let me say this once again: plot your
losses.

7 / 61
Gradient descent
To minimize L(θ) , standard batch gradient descent (GD) consists in applying the
update rule
N
1
gt = ∑ ∇θ ℓ(yn , f (x n ; θ t ))
N n=1
θt+1 = θt − γg t ,

where γ is the learning rate.

8 / 61
9 / 61
Stochastic gradient descent

While it makes sense to compute the gradient exactly,

it takes time to compute and becomes inefficient for large N ,

it is an empirical estimation of an hidden quantity (the expected risk), and
any partial sum is also an unbiased estimate, although of greater variance.

10 / 61
To reduce the computational complexity, stochastic gradient descent (SGD)
consists in updating the parameters after every sample

gt = ∇θ ℓ(yn(t), f (x n(t) ; θ t ))
θt+1 = θt − γg t .

11 / 61
0:00 / 0:15

12 / 61
While being computationally faster than batch gradient descent,

gradient estimates used by SGD can be very noisy, which may help escape
from local minima;
but SGD does not benefit from the speed-up of batch-processing.

13 / 61
Mini-batching
Instead, mini-batch SGD consists in visiting the samples in mini-batches and
updating the parameters each time

1 B
gt = B ∑ ∇θ ℓ(yn(t,b), f (x n(t,b) ; θ t ))
b=1
θt+1 = θt − γg t ,

where the order n(t, b) to visit the samples can be either sequential or random.

• Increasing the batch size B reduces the variance of the gradient

estimates and enables the speed-up of batch processing.
• The interplay between B and γ is still unclear.

14 / 61
Limitations

Gradient descent makes strong assumptions about

the magnitude of the local curvature to set the step size,

the isotropy of the curvature, so that the same step size γ makes sense
in all directions.

15 / 61
0:00 / 0:15

γ = 0.01
16 / 61
0:00 / 0:15

γ = 0.01
17 / 61
0:00 / 0:15

γ = 0.1
18 / 61
0:00 / 0:15

γ = 0.4
19 / 61
Wolfe conditions could be used to design line search algorithms to automatically
determine a step size γ t , hence ensuring convergence towards a local minima.

However, in deep learning,

these algorithms are impractical because of the size of the parameter space
and the overhead it would induce,
they might lead to overfitting when the empirical risk is minimized too well.

20 / 61
The tradeoffs of large-scale learning
A fundamental result due to Bottou and Bousquet (2011) states that stochastic
optimization algorithms (e.g., SGD) yield the best generalization performance (in
terms of excess error) despite being the worst optimization algorithms for
minimizing the empirical risk.

That is, for a fi xe d computational budget, stochastic optimization

algorithms reach a lower test error than more sophisticated algorithms
(2nd order methods, line search algorithms, etc) that would fit the training error
too well or would consume too large a part of the computational budget at every
step.

21 / 61
22 / 61
Momentum

In the situation of small but consistent gradients, as through valley floors,

gradient descent moves very slowly.

23 / 61
An improvement to gradient descent is to use momentum to add inertia in the
choice of the step direction, that is

u t = αut−1 − γgt
θt+1 = θt + u t .

The new variable u t is the velocity. It

corresponds to the direction and speed by
which the parameters move as the
ut
learning dynamics progresses, modeled αut−1
as an exponentially decaying moving
average of negative gradients.
– γgt
Gradient descent with momentum has
three nice properties:
- it can go through local barriers,
- it accelerates if the gradient does not change much,
- it dampens oscillations in narrow valleys.

25 / 61
The hyper-parameter α controls how recent gradients affect the current update.

Usually, α = 0.9, with α > γ .

If at each update we observed g, the step would (eventually) be
γ
u = − 1 − α g.

Therefore, for α = 0.9, it is like multiplying the maximum speed by 10

relative to the current direction.

25 / 61
0:00 / 0:15

26 / 61
Nesterov momentum
An alternative consists in simulating a step in the direction of the velocity, then
calculate the gradient and make a correction.
N
1
gt = ∑ ∇θ ℓ(yn , f (x n ; θt + αut−1 ))
N n=1
u t = αut−1 − γgt
θt+1 = θt + u t

– γgt

αut−1

27 / 61
0:00 / 0:15

28 / 61
Adaptive learning rate
Vanilla gradient descent assumes the isotropy of the curvature, so that the same
step size γ applies to all parameters.

Isotropic vs. Anistropic

29 / 61
AdaGrad
Per-parameter downscale by square-root of sum of squares of all its historical
values.

r t = rt−1 + gt ⊙ gt
γ
θt+1 = θt − ⊙ gt .
δ + rt

AdaGrad eliminates the need to manually tune the learning rate. Most
implementation use γ = 0.01 as default.
It is good when the objective is convex.
r t grows unboundedly during training, which may cause the step size to
shrink and eventually become in nitesimally small.

30 / 61
RMSProp
Same as AdaGrad but accumulate an exponentially decaying average of the
gradient.

r t = ρrt−1 + (1 − ρ)gt ⊙ gt
γ
θt+1 = θt − δ + r ⊙ gt .
t

Perform better in non-convex settings.

Does not grow unboundedly.

31 / 61
Adam
Similar to RMSProp with momentum, but with bias correction terms for the first
and second moments.

s t = ρ1 st−1 + (1 − ρ1 )gt
𝑆𝑡
𝑆෠𝑡 =
1 − ρ1t
r t = ρ2 rt−1 + (1 − ρ2 )gt ⊙ gt
rt
𝑟ො𝑡 =
1 − ρ2t
𝑠ො𝑡
θt+1 = θt − γ
δ+ 𝑟ො𝑡

Good defaults are ρ1 = 0.9 and ρ2 = 0.999.

Adam is one of the default optimizers in deep learning, along with SGD
with momentum.

32 / 61
0:00 / 0:15

33 / 61
Weight decay
Weight decay is a regularization technique that penalizes large weights.
For vanilla SGD, it is equivalent to adding a penalty term to the loss function

λ
ℓθ + ∣∣θ∣∣2.
2
For more complex optimizers, it is equivalent to adding a penalty term
to the update rule

θt+1 = θt − γ (gt + λθ) .

34 / 61
Training without (left) and with (right) weight decay.

36 / 61
Learning rate

36 / 61
Scheduling
Despite per-parameter adaptive learning rate methods, it is usually helpful to
anneal the learning rate γ over time.

Step decay: reduce the learning rate by some factor every few epochs (e.g,
by half every 10 epochs).
Exponential decay: γ t = γ0 exp(−kt) where γ0 and k are hyper-
parameters.
1/t decay: γ t = γ0 /(1 + kt) where γ0 and k are hyper-parameters.

37 / 61
Step decay scheduling for training ResNets.

38 / 61
Warmup and cyclical schedules

39 / 61
Initialization

40 / 61
In convex problems, provided a good learning rate γ, convergence is guaranteed
regardless of the initial parameter values.

In the non-convex regime, initialization is much more important! Little is known on

the mathematics of initialization strategies of neural networks.
• What is known: initialization should break symmetry.
• What is known: the scale of weights is important.

41 / 61
Controlling for the variance in the forward pass
A first strategy is to initialize the network parameters such that activations
preserve the same variance across layers.

Intuitively, this ensures that the information keeps flowing during the forward
pass, without reducing or magnifying the magnitude of input signals
exponentially.

43 / 61
Let us assume that

we are in a linear regime at initialization (e.g., the positive part of a ReLU

or the middle of a sigmoid),
weights wlij are initialized i.i.d,
biases bl are initialized to be 0,
input features are i.i.d, with a variance denoted as V[x].

Then, the variance of the activation hil of unit i in layer l is

ql−1 −1 𝑙 𝑙 1
V ℎ𝑙 = 𝑉 σ 𝑤𝑖𝑗 ℎ𝑗 −
𝑖 j=0
ql−1 −1
=σ 𝑉 𝑤 𝑙 𝑉 ℎ 𝑙 −1
j=0 𝑖𝑗 𝑗

where ql is the width of layer l and hj0 = x j for all j = 0, ..., p − 1.

44 / 61
Since the weights wlij at layer l share the same variance V 𝑤 𝑙 and the
variance of the activations in the previous layer are the same, we can drop the
indices and write
V ℎ 𝑙 = ql−1V 𝑤 𝑙 V ℎ 𝑙 −1

Therefore, the variance of the activations is preserved across layers when

1
V 𝑤𝑙 = ∀l
ql−1

This condition is enforced in LeCun's uniform initialization, which is defined as

3 , 3
wlij ~ U [− ]
ql−1 ql−1

45 / 61
Controlling for the variance in the backward pass
A similar idea can be applied to ensure that the gradients flow in the backward
pass (without vanishing nor exploding), by maintaining the variance of the
gradient with respect to the activations fixed across layers.

Under the same assumptions as before,

ql+1 −1
d𝑦Ƹ d𝑦Ƹ ∂hl+1
V[ l] = V[ ∑ j
l ]
dhi j=0
dhl+1
j
∂h i
ql+1 −1
d𝑦Ƹ
=V[ ∑ +1 w l+1
]
j=0
dhlj j,i

ql+1 −1
d𝑦Ƹ
= ∑ V[ +1
] V [ wl+1
]
j=0
dhlj ji

46 / 61
If we further assume that

• the gradients of the activations at layer l share the same variance

• the weights at layer l + 1 share the same variance V[w l+1],

then we can drop the indices and write

d𝑦Ƹ d𝑦Ƹ
V [ l ] = ql+1 V [ l+1 ] V [wl+1 ].
dh dh

Therefore, the variance of the gradients with respect to the activations is

preserved across layers when
1
V [wl ] = ∀l.
ql

47 / 61
Xavier initialization

We have derived two different conditions on the variance of w l ,

1
V[wl ] =
𝑞𝑙−1
1
V[wl ] =
𝑞𝑙
A compromise is the Xavier initialization, which initializes w l randomly from a
distribution with variance
1 2
V[wl ] = ql−1 +ql
= .
2
ql−1 + ql

For example, normalized initialization is defined as

6 , 6
wlij ~ U [− ].
ql−1 + ql ql−1 + ql

48 / 61
49 / 61
50 / 61
He initialization

Because ReLU(x) = max(0, x), the mean of the activations at layer l is

typically not 0. Therefore, our zero-mean assumption is wrong. Accounting for
this shift, He et al (2015) derive a forward initialization scheme that initializes wl
from a distribution with variance
2
V[wl ] = .
ql−1

51 / 61
52 / 61
Normalization

53 / 61
Data normalization
Previous weight initialization strategies rely on preserving the activation
variance constant across layers, under the assumption that the input feature
variances are the same. That is,

V [x i ] = V [x j ] ≜ V [x]

for all pairs of features i, j .

56 / 61
In general, this constraint is not satisfied but can be enforced by standardizing
the input data feature-wise,
1
x ′ = (x −μ)Ƹ ⊙ ,
σෝ
where
1 1
μƸ = ∑ x σෝ 2 = ∑ ( x − μ)Ƹ 2 .
N x∈d N x∈d

57 / 61
Batch normalization
Maintaining proper statistics of the activations and derivatives is critical for
training neural networks.

This constraint can be enforced explicitly during the forward pass by re-
normalizing them. Batch normalization was the first method introducing this
idea.

58 / 61
Let us consider a minibatch of samples at training, for which ub ∈ Rq ,
b = 1, ..., B , are intermediate values computed at some location in the
computational graph.

In batch normalization following the node u, the per-component mean and

variance are first computed on the batch

1 B σො 2 = 1 B
μƸ batch = B ∑ u b batch ∑ ( u b − μƸ batch)2 ,
b=1
B
b=1

from which the standardized u′b ∈ Rq are computed such that

1
u′b = γ ⊙ (u b − μƸ batch) ⊙ +β
σෝ batch +ϵ

where γ, β ∈ Rq are parameters to optimize.

During testing, the mean and variance computed on the entire training set and
used to standardize the activations.

59 / 61
60 / 61
Layer normalization
Layer normalization is a variant of batch normalization that normalizes the
activations across the features of each sample, rather than across the samples
of each feature:
1
u′ = γ ⊙ (u − μƸ layer ) ⊙ + β.
σො layer + ϵ

61 / 61
The
end.

61 / 61

Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
54 pages
Understanding Machine Learning Optimizers
No ratings yet
Understanding Machine Learning Optimizers
4 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
27 pages
Neural Network Optimization Strategies
No ratings yet
Neural Network Optimization Strategies
17 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
70 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
18 pages
DL-Module 2
No ratings yet
DL-Module 2
30 pages
Training Supervised Deep Learning Models
No ratings yet
Training Supervised Deep Learning Models
25 pages
Gradient Descent Optimization Overview
No ratings yet
Gradient Descent Optimization Overview
34 pages
Unit2 DeepLearning ComprehensiveNotes
No ratings yet
Unit2 DeepLearning ComprehensiveNotes
20 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
67 pages
Deep Learningmod 2
No ratings yet
Deep Learningmod 2
111 pages
Unit 2
No ratings yet
Unit 2
95 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
22 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
31 pages
Deep Learning: Optimisation Techniques
No ratings yet
Deep Learning: Optimisation Techniques
69 pages
Stochastic Gradient Descent Overview
No ratings yet
Stochastic Gradient Descent Overview
14 pages
Gradient Descent in Neural Network Optimization
No ratings yet
Gradient Descent in Neural Network Optimization
33 pages
2 Chapter2
No ratings yet
2 Chapter2
26 pages
Understanding Optimizers in Deep Learning
No ratings yet
Understanding Optimizers in Deep Learning
37 pages
Understanding Gradient Descent Techniques
No ratings yet
Understanding Gradient Descent Techniques
31 pages
Optimizers for Neural Network Training
No ratings yet
Optimizers for Neural Network Training
9 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
42 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
32 pages
Optimizing Neural Network Training Techniques
No ratings yet
Optimizing Neural Network Training Techniques
34 pages
Deep Learning Optimization Dynamics
No ratings yet
Deep Learning Optimization Dynamics
2 pages
Types of Optimizers in Deep Learning
No ratings yet
Types of Optimizers in Deep Learning
15 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
24 pages
Loss Functions and Gradient Descent Techniques
No ratings yet
Loss Functions and Gradient Descent Techniques
5 pages
Mini-Batch Gradient Descent Explained
No ratings yet
Mini-Batch Gradient Descent Explained
23 pages
Gradient Descent in Deep Learning
No ratings yet
Gradient Descent in Deep Learning
28 pages
Overview of Gradient Descent Methods
No ratings yet
Overview of Gradient Descent Methods
3 pages
Session 9
No ratings yet
Session 9
11 pages
Machine Learning Optimization Techniques
No ratings yet
Machine Learning Optimization Techniques
51 pages
Key Deep Learning Terms Explained
No ratings yet
Key Deep Learning Terms Explained
9 pages
Gradient Descent Methods in Machine Learning
No ratings yet
Gradient Descent Methods in Machine Learning
26 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
14 pages
ADADELTA: Adaptive Learning Rates
No ratings yet
ADADELTA: Adaptive Learning Rates
6 pages
Optimization in Deep Learning
No ratings yet
Optimization in Deep Learning
10 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
21 pages
Neural Network Training and Optimization
No ratings yet
Neural Network Training and Optimization
34 pages
Unit 2 - DLTM
No ratings yet
Unit 2 - DLTM
62 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
86 pages
Stochastic Gradient Descent Explained
No ratings yet
Stochastic Gradient Descent Explained
8 pages
Deep Learning: Gradient Optimization Techniques
No ratings yet
Deep Learning: Gradient Optimization Techniques
40 pages
Optimizers Nag
No ratings yet
Optimizers Nag
29 pages
Deep Learning Optimization Strategies
No ratings yet
Deep Learning Optimization Strategies
32 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
46 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
41 pages
Stochastic Gradient Descent Explained
No ratings yet
Stochastic Gradient Descent Explained
27 pages
Lecture 03 2
No ratings yet
Lecture 03 2
47 pages
25-Ms-Ai-08 Ass#01 ML
No ratings yet
25-Ms-Ai-08 Ass#01 ML
5 pages
Training Neural Networks with Gradient Descent
No ratings yet
Training Neural Networks with Gradient Descent
4 pages
Lec3 Slides
No ratings yet
Lec3 Slides
24 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
46 pages
Adaptive Learning Rates Part II
No ratings yet
Adaptive Learning Rates Part II
14 pages
Understanding SQL Joins and Types
50% (2)
Understanding SQL Joins and Types
24 pages
Pharmacology ICMR STS 2026-27
No ratings yet
Pharmacology ICMR STS 2026-27
2 pages
Forestry Agroforestry Exam Paper 2020
No ratings yet
Forestry Agroforestry Exam Paper 2020
62 pages
Teachers' Knowledge on Learning Disabilities
No ratings yet
Teachers' Knowledge on Learning Disabilities
14 pages
Importance of Gusset Stays in Boilers
No ratings yet
Importance of Gusset Stays in Boilers
39 pages
Educational Publication Details
No ratings yet
Educational Publication Details
40 pages
DMGT Mid-II Paper
No ratings yet
DMGT Mid-II Paper
1 page
Astron Concrete Products Guidelines
No ratings yet
Astron Concrete Products Guidelines
30 pages
Reflection on "Hacker" Documentary
No ratings yet
Reflection on "Hacker" Documentary
3 pages
Grade 9 Math Trivia Summary
No ratings yet
Grade 9 Math Trivia Summary
2 pages
Vehicle Bill of Lading for Pedro Morales
No ratings yet
Vehicle Bill of Lading for Pedro Morales
1 page
C Programming Lab Manual for CSE II Semesters
No ratings yet
C Programming Lab Manual for CSE II Semesters
80 pages
Covered Court Construction Estimate
No ratings yet
Covered Court Construction Estimate
255 pages
Insights on Economic Globalization
100% (1)
Insights on Economic Globalization
3 pages
Mayan, Aztec, and Inca Civilizations Overview
No ratings yet
Mayan, Aztec, and Inca Civilizations Overview
2 pages
Civilization of the Spectacle Explained
No ratings yet
Civilization of the Spectacle Explained
4 pages
BBA - Course Outline-Spring 2020
100% (1)
BBA - Course Outline-Spring 2020
213 pages
India's First Net Zero Energy Building
No ratings yet
India's First Net Zero Energy Building
3 pages
Library Science Entrance Exam Questions 2021
No ratings yet
Library Science Entrance Exam Questions 2021
13 pages
Optimize Frizbee Website for SEO
No ratings yet
Optimize Frizbee Website for SEO
4 pages
PhotoMOS Short Circuit Protection Guide
No ratings yet
PhotoMOS Short Circuit Protection Guide
16 pages
Student Bus Route Details 2022
No ratings yet
Student Bus Route Details 2022
2 pages
PetroKnowledge 2024 Training Calendar
No ratings yet
PetroKnowledge 2024 Training Calendar
5 pages
Bugatti Veyron Instructions
0% (1)
Bugatti Veyron Instructions
19 pages
NetSol Employment Application Guide
No ratings yet
NetSol Employment Application Guide
8 pages
Understanding Human Resource Management
No ratings yet
Understanding Human Resource Management
172 pages
Fire Safety Inspection Checklist Guide
No ratings yet
Fire Safety Inspection Checklist Guide
2 pages
Three-Legged Folding Sawhorse Plans
No ratings yet
Three-Legged Folding Sawhorse Plans
8 pages
HVAC Systems Overview and Types
No ratings yet
HVAC Systems Overview and Types
37 pages
Arduino Heart Rate Monitor System
No ratings yet
Arduino Heart Rate Monitor System
15 pages

Optimizing Neural Network Training Techniques

Uploaded by

Optimizing Neural Network Training Techniques

Uploaded by

Deep Learning

Prof. FATIMA-EZZAHRAA BEN-BOUAZZA

Lecture 2: Training neural networks

d = arg min L(θ) =

A fi rs t step towards understanding, debugging and optimizing neural networks

• plotting losses and metrics,

• visualizing computational graphs,

• or showing additional data as the

where γ is the learning rate.

While it makes sense to compute the gradient exactly,

it takes time to compute and becomes inefficient for large N ,

• Increasing the batch size B reduces the variance of the gradient

Gradient descent makes strong assumptions about

the magnitude of the local curvature to set the step size,

However, in deep learning,

That is, for a fi xe d computational budget, stochastic optimization

In the situation of small but consistent gradients, as through valley floors,

The new variable u t is the velocity. It

Usually, α = 0.9, with α > γ .

Therefore, for α = 0.9, it is like multiplying the maximum speed by 10

Isotropic vs. Anistropic

Perform better in non-convex settings.

Good defaults are ρ1 = 0.9 and ρ2 = 0.999.

θt+1 = θt − γ (gt + λθ) .

In the non-convex regime, initialization is much more important! Little is known on

we are in a linear regime at initialization (e.g., the positive part of a ReLU

Then, the variance of the activation hil of unit i in layer l is

where ql is the width of layer l and hj0 = x j for all j = 0, ..., p − 1.

Therefore, the variance of the activations is preserved across layers when

This condition is enforced in LeCun's uniform initialization, which is defined as

Under the same assumptions as before,

• the gradients of the activations at layer l share the same variance

then we can drop the indices and write

Therefore, the variance of the gradients with respect to the activations is

We have derived two different conditions on the variance of w l ,

For example, normalized initialization is defined as

Because ReLU(x) = max(0, x), the mean of the activations at layer l is

for all pairs of features i, j .

In batch normalization following the node u, the per-component mean and

from which the standardized u′b ∈ Rq are computed such that

where γ, β ∈ Rq are parameters to optimize.

You might also like