0% found this document useful (0 votes)

27 views44 pages

Truncated BPTT in Deep Learning

Lecture 14 of CS7015 focuses on sequence learning problems and introduces Recurrent Neural Networks (RNNs) as a solution for tasks where inputs are not fixed in size and depend on previous inputs. It discusses the challenges of vanishing and exploding gradients, and introduces Backpropagation Through Time (BPTT) as a method for training RNNs. The lecture emphasizes the importance of parameter sharing across time steps to maintain consistency in the function executed at each step.

Uploaded by

Vishakha Madaan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views44 pages

Truncated BPTT in Deep Learning

Uploaded by

Vishakha Madaan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS7015 (Deep Learning) : Lecture 14

Sequence Learning Problems, Recurrent Neural Networks, Backpropagation

Through Time (BPTT), Vanishing and Exploding Gradients, Truncated BPTT

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.1: Sequence Learning Problems

2/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
For example, we fed fixed size (32 ×
32) images to convolutional neural
networks for image classification
10
5

10 5

3/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
P (chair|sat, he)

P (man|sat, he)
neural networks the size of the input

P (on|sat, he)
P (he|sat, he)

was always fixed

. . . . . . . . . For example, we fed fixed size (32 ×
32) images to convolutional neural
. . . . . . . . . . . . networks for image classification
Similarly in word2vec, we fed a fixed
window (k) of words to the network
. . . . . . . . . .

Wcontext Wcontext

he sat
4/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
apple
For example, we fed fixed size (32 ×
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
window (k) of words to the network
Further, each input to the network
was independent of the previous or
future inputs

5/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
apple
For example, we fed fixed size (32 ×
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
window (k) of words to the network
Further, each input to the network
was independent of the previous or
future inputs
For example, the computatations,
outputs and decisions for two success-
ive images are completely independ-
ent of each other
6/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In many applications the input is not
of a fixed size
Further successive inputs may not be
e e p h stop i independent of each other
For example, consider the task of
auto completion
Given the first character ‘d’ you want
to predict the next character ‘e’ and
so on

d e e p

7/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Notice a few things
First, successive inputs are no longer
independent (while predicting ‘e’ you
e e p h stop i would want to know what the previ-
ous input was in addition to the cur-
rent input)
Second, the length of the inputs and
the number of predictions you need
to make is not fixed (for example,
“learn”, “deep”, “machine” have dif-
ferent number of characters)
Third, each network (orange-blue-
d e e p
green structure) is performing the
same task (input : character output
: character)

8/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
These are known as sequence learning
problems
We need to look at a sequence of (de-
e e p h stop i pendent) inputs and produce an out-
put (or outputs)
Each input corresponds to one time
step
Let us look at some more examples of
such problems

d e e p

9/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Consider the task of predicting the part
of speech tag (noun, adverb, adjective
verb) of each word in a sentence
Once we see an adjective (social) we are
noun verb article adjective noun almost sure that the next word should be
a noun (man)
Thus the current output (noun) depends
on the current input as well as the previ-
ous input
Further the size of the input is not fixed
(sentences could have arbitrary number
of words)
Notice that here we are interested in pro-
man is a social animal ducing an output at each time step
Each network is performing the same
task (input : word, output : tag)
10/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Sometimes we may not be interested
in producing an output at every stage
Instead we would look at the full se-
don’t
care
don’t
care
don’t
care
don’t
care
don’t
care +/−
quence and then produce an output
For example, consider the task of pre-
dicting the polarity of a movie review
The prediction clearly does not de-
pend only on the last word but also
on some words which appear before
Here again we could think that the
network is performing the same task
at each step (input : word, output :
The movie was boring and long
+/−) but it’s just that we don’t care
about intermediate outputs

11/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Sequences could be composed of any-
thing (not just words)
For example, a video could be treated
as a sequence of images
Surya Namaskar We may want to look at the entire se-
quence and detect the activity being
performed

...

12/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.2: Recurrent Neural Networks

13/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
How do we model such tasks involving sequences ?

14/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences

15/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
What is the function being executed
at each time step ?

y1 y2
si = σ(U xi + b)
yi = O(V si + c)
i = timestep
V V
Since we want the same function to be
s1 s2 executed at each timestep we should
share the same network (i.e., same
U U
parameters at each timestep)

x1 x2

16/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
This parameter sharing also ensures
that the network becomes agnostic to
the length (size) of the input
y1 y2 y3 y4 yn Since we are simply going to compute
the same function (with same para-
meters) at each timestep, the number
V V V V V of timesteps doesn’t matter
We just create multiple copies of the
s1 s2 s3 s4 . . . sn
network and execute them at each
U U U U U timestep

x1 x2 x3 x4 xn

17/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
y1 y2 How do we account for dependence
between inputs ?
v v Let us first see an infeasible way of
doing this
u u
At each timestep we will feed all the
previous inputs to the network
x1 x1 x2
Is this okay ?
No, it violates the other two items on
y3 y4
our wishlist
How ? Let us see
v v

u u

x1 x2 x3 x1 x2 x3 x4

18/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
y1 y2 First, the function being computed at
each time-step now is different
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
y3 = f3 (x1 , x2 , x3 )
x1 x1 x2

The network is now sensitive to the

length of the sequence
y3 y4

For example a sequence of length

10 will require f1 , . . . , f10 whereas a
v v
sequence of length 100 will require
f1 , . . . , f100
u u

x1 x2 x3 x1 x2 x3 x4

19/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
The solution is to add a recurrent
connection in the network,

y1 y2 y3 y4 yn
si = σ(U xi + W si−1 + b)
yi = O(V si + c)
or
V V V V V yi = f (xi , si−1 , W, U, V, b, c)
W W W W ... W sn si is the state of the network at
timestep i
U U U U U
The parameters are W, U, V, c, b
which are shared across timesteps
x1 x2 x3 x4 xn
The same network (and parameters)
can be used to compute y1 , y2 , . . . , y10
or y100

20/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
This can be represented more com-
pactly

si W

21/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Let us revisit the sequence learning
e e p h stop i noun verb article adjective noun problems that we saw earlier
We now have recurrent connections
between time steps which account for
dependence between inputs
d e e p man is a social animal

Surya Namaskar
don’t don’t don’t don’t don’t
care care care care care +/−

...

...
The movie was boring and long

22/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.3: Backpropagation through time

23/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Before proceeding let us look at the
dimensions of the parameters care-
fully
y1 y2 y3 y4
xi ∈ Rn (n-dimensional input)
s i ∈ Rd (d-dimensional state)
k
V V V V yi ∈ R (say k classes)
W W W W U ∈ Rn×d
V ∈ Rd×k
U U U U
W ∈ Rd×d
x1 x2 x3 x4

24/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
How do we train this network ?
(Ans: using backpropagation)
Let us understand this with a con-
y1 y2 y3 y4 crete example

V V V V
W W W W

U U U U

x1 x2 x3 x4

25/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Suppose we consider our task of auto-
completion (predicting the next char-
acter)
For simplicity we assume that there
e e p h stop i are only 4 characters in our vocabu-
lary (d,e,p, <stop>)
At each timestep we want to predict
V V V V one of these 4 characters
W W W What is a suitable output function for
this task ? (softmax)
U U U U What is a suitable loss function for
this task ? (cross entropy)
d e e p

26/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Suppose we initialize U, V, W ran-
L1 (θ) L2 (θ) L3 (θ) L4 (θ) domly and the network predicts the
y1 y2 y3 y4
probabilities as shown
PredictedTrue Predicted
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 And the true probabilities are as
0 0.1 0 0.7 1 0.7 0
stop 0.1 0 0.1 0 0.1 0 0.1 1 shown
V V V V We need to answer two questions
W W W What is the total loss made by the
model ?
U U U U How do we backpropagate this loss
and update the parameters (θ =
{U, V, W, b, c}) of the network ?
d e e p

27/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
The total loss is simply the sum of the
L1 (θ) L2 (θ) L3 (θ) L4 (θ) loss over all time-steps
y1 y2 y3 y4
PredictedTrue Predicted
True Predicted
True Predicted
True T
d 0.2 0 0.2 0 0.2 0 0.2 0 X
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
V V V V Lt (θ) = −log(ytc )
W W W ytc = predicted probability of true
character at time-step t
U U U U T = number of timesteps

d e e e
For backpropagation we need to com-
pute the gradients w.r.t. W, U, V, b, c
Let us see how to do that

28/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Let us consider ∂L (θ)
∂V (V is a matrix
L1 (θ) L2 (θ) L3 (θ) L4 (θ) so ideally we should write ∇v L (θ))
y1 y2 y3 y4
PredictedTrue Predicted
True Predicted
True Predicted
True T
d 0.2
e 0.7
0
1
0.2
0.7
0
1
0.2
0.1
0
0
0.2
0.1
0
0
∂L (θ) X ∂Lt (θ)
p 0.1 0 0.1 0 0.7 1 0.7 1 =
stop 0.1 0 0.1 0 0.1 0 0.1 0 ∂V ∂V
t=1
V V V V
W W W Each term is the summation is simply
the derivative of the loss w.r.t. the
U U U U weights in the output layer
We have already seen how to do this
when we studied backpropagation
d e e e

29/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂L (θ)
Let us consider the derivative ∂W
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
y1 y2 y3 y4 T
PredictedTrue Predicted
True Predicted
True Predicted
True
∂L (θ) X ∂Lt (θ)
d 0.2 0 0.2 0 0.2 0 0.2 0 =
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 ∂W ∂W
0 0.1 0 0.7 1 0.7 1 t=1
stop 0.1 0 0.1 0 0.1 0 0.1 0
By the chain rule of derivatives we
V V V V know that ∂L t (θ)
∂W is obtained by sum-
W W W ming gradients along all the paths
from Lt (θ) to W
U U U U What are the paths connecting Lt (θ)
to W ?
Let us see this by considering L4 (θ)
d e e e

30/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
L4 (θ) depends on s4
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 in turn depends on s3 and W
s3 in turn depends on s2 and W
s2 in turn depends on s1 and W
V V V V s1 in turn depends on s0 and W
s1
W s2
W s3
W s4
W ... where s0 is a constant starting state.

U U U U

x1 x2 x3 x4

s0 s1 s2 s3 s4 L4 (θ)

31/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
What we have here is an ordered net-
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
In an ordered network each state vari-
able is computed one at a time in a
V V V V
specified order (first s1 , then s2 and
W W W W ...
s1 s2 s3 s4 so on)
U U U U Now we have
∂L4 (θ) ∂L4 (θ) ∂s4
x1 x2 x3 x4 =
∂W ∂s4 ∂W
s0 s1 s2 s3 s4 L4 (θ)
We have already seen how to compute
∂L4 (θ)
∂s4 when we studied backprop
∂s4
W But how do we compute ∂W

32/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)

s4 = σ(W s3 + b)

33/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂ + s4 ∂s4 h ∂ + s3 ∂s3 ∂s2 i
= + +
∂W ∂s3 |∂W ∂s ∂W
{z } | 2{z }
explicit implicit
∂ + s4 ∂s4 ∂ + s3
∂s4 ∂s3 h ∂ + s2 ∂s2 ∂s1 i
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s1 ∂W
∂ s4 ∂s4 ∂ s3 ∂s4 ∂s3 ∂ s2 ∂s4 ∂s3 ∂s2 h ∂ + s1 i
+ + +
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s3 ∂s2 ∂s1 ∂W
For simplicity we will short-circuit some of the paths
4
∂s4 ∂s4 ∂ + s4 ∂s4 ∂ + s3 ∂s4 ∂ + s2 ∂s4 ∂ + s1 X ∂s4 ∂ + sk
= + + + =
∂W ∂s4 ∂W ∂s3 ∂W ∂s2 ∂W ∂s1 ∂W ∂sk ∂W
k=1
34/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L4 (θ) ∂L4 (θ) ∂s4
=
∂W ∂s4 ∂W
V V V V 4
X ∂s4 ∂ + sk
∂s4
W W W W =
s1 s2 s3 s4 ... ∂W ∂sk ∂W
k=1
t
U U U U ∂Lt (θ) ∂Lt (θ) X ∂st ∂ + sk
∴ =
∂W ∂st ∂sk ∂W
x1 x2 x3 x4
k=1

s0 s1 s2 s3 s4 L4 (θ) This algorithm is called backpropaga-

tion through time (BPTT) as we
backpropagate over all previous time
W steps

35/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.4: The problem of Exploding and Vanishing
Gradients

36/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂st
We will now focus on ∂s k
and high-
light an important problem in train-
ing RNN’s using BPTT

∂st ∂st ∂st−1 ∂sk+1

= ...
∂sk ∂st−1 ∂st−2 ∂sk
t−1
Y ∂sj+1
=
∂sj
j=k

Let us look at one such term in the

∂s
product (i.e., ∂sj+1
j
)

37/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂sj
We are interested in ∂sj−1

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W sj + b

sj = [σ(aj1 ), σ(aj2 ), . . . σ(ajd )] sj = σ(aj )

 ∂s
j1 ∂sj2 ∂sj3
...
 ∂sj ∂sj ∂aj
=
∂sj  ∂aj1 ∂aj1 ∂aj1
..  ∂sj−1 ∂aj ∂sj−1
 ∂sj1 ∂sj2
.
=
 0
 ∂aj2 ∂aj2
∂aj = diag(σ (aj ))W

.. .. ..
 
∂sjd
. . . ∂ajd
 0 
σ (aj1 ) 0 0 0
0
 0 σ (aj2 ) 0 0 
=

.
 We are interested in the magnitude
 0 0 . .

∂sj ∂st
0
 of ∂sj−1 ← if it is small (large) ∂s k
0 0 . . . σ (ajd ) ∂Lt
and hence ∂W will vanish (explode)
0
= diag(σ (aj ))

38/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂sj 0 t
= diag(σ (aj ))W ∂st Y ∂sj
∂sj−1 =
∂sk ∂sj−1
0 j=k+1
≤ diag(σ (aj ) kW k
t
Y
≤ γλ
∵ σ(aj ) is a bounded function (sigmoid,
0 j=k+1
tanh) σ (aj ) is bounded
≤ (γλ)t−k

0 1
σ (aj ) ≤ = γ [if σ is logistic ]
4 If γλ < 1 the gradient will vanish
≤ 1 = γ [if σ is tanh ]
If γλ > 1 the gradient could explode
∂sj
≤ γ kW k This is known as the problem of
∂sj−1 vanishing/ exploding gradients
≤ γλ

39/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
One simple way of avoiding this is to
use truncated backpropogation
where we restrict the product to
τ (< t − k) terms
Lt
y1 y2 y3 y4 yn

v v v v v
w w w w

u u u u u

x1 x2 x3 x4 xn

40/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.5: Some Gory Details

41/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
t
∂Lt (θ) ∂Lt (θ) X ∂st ∂ + sk
=
| ∂W
{z } ∂s ∂sk |∂W
| {zt } k=1 |{z} {z }
∈Rd×d ∈R1×d ∈Rd×d ∈R
d×d×d

We know how to compute ∂L∂st (θ)

t
(derivative of Lt (θ) (scalar) w.r.t. last
hidden layer (vector)) using backpropagation
∂st
We just saw a formula for ∂sk which is the derivative of a vector w.r.t. a
vector)
∂ + sk
∂W is a tensor ∈ Rd×d×d , the derivative of a vector ∈ Rd w.r.t. a matrix
∈ Rd×d
∂ + sk
How do we compute ∂W ? Let us see

42/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
∂ + sk
We just look at one element of this ∂W tensor
∂ + skp
∂Wqr is the (p, q, r)-th element of the 3d tensor
ak = W sk−1 + b
sk = σ(ak )

43/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Pd
ak = W sk−1 ∂akp ∂ i=1Wpi sk−1,i
=
  
ak1 W11 W12 ... W1d

sk−1,1
 ∂Wqr ∂Wqr
ak2    sk−1,2  = sk−1,i if p=q and i = r
 ..   .. .. .. ..   .. 
    
 .   . . . .   .  =0 otherwise
 =  
akp  Wp1 Wp2 . . . Wpd  sk−1,p  ∂skp
= σ 0 (akp )sk−1,r
 
  
 .   . if p = q and i = r
.. .. ..   ..  ∂Wqr
 ..   .. . . .  . 
akd sk−1,d =0 otherwise
d
X
akp = Wpi sk−1,i
i=1
skp = σ(akp )
∂skp ∂skp ∂akp
=
∂Wqr ∂akp ∂Wqr
∂akp
= σ 0 (akp )
∂Wqr
44/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14

Sequence Learning in Deep Learning
No ratings yet
Sequence Learning in Deep Learning
22 pages
Sequence Learning in Deep Learning
No ratings yet
Sequence Learning in Deep Learning
34 pages
DL14
No ratings yet
DL14
43 pages
LSTMs and GRUs in Deep Learning
No ratings yet
LSTMs and GRUs in Deep Learning
43 pages
LSTM
No ratings yet
LSTM
26 pages
Deep Sequence Modeling Techniques
No ratings yet
Deep Sequence Modeling Techniques
222 pages
RNNs and Recursive Neural Networks Explained
No ratings yet
RNNs and Recursive Neural Networks Explained
23 pages
Self-Attention in NLP: CS224n Insights
No ratings yet
Self-Attention in NLP: CS224n Insights
18 pages
Recurrent and Recursive Neural Networks
No ratings yet
Recurrent and Recursive Neural Networks
45 pages
RNNs and Sequence Modeling Explained
No ratings yet
RNNs and Sequence Modeling Explained
34 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
191 pages
LSTMs and GRUs in Deep Learning
No ratings yet
LSTMs and GRUs in Deep Learning
196 pages
LSTM and GRU Concepts Explained
No ratings yet
LSTM and GRU Concepts Explained
196 pages
DL UNIT-5 Word
No ratings yet
DL UNIT-5 Word
17 pages
RNN and Computational Graphs in Deep Learning
No ratings yet
RNN and Computational Graphs in Deep Learning
107 pages
BCS714A Module 4 PDF
No ratings yet
BCS714A Module 4 PDF
34 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
44 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
34 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
47 pages
Encoder-Decoder Models in Deep Learning
No ratings yet
Encoder-Decoder Models in Deep Learning
63 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Sequence Models in NLP Explained
No ratings yet
Sequence Models in NLP Explained
195 pages
Introduction to Recurrent Neural Networks
No ratings yet
Introduction to Recurrent Neural Networks
106 pages
Lecture 4 - Neural Language Modeling-II
No ratings yet
Lecture 4 - Neural Language Modeling-II
48 pages
Sequence Modeling with RNNs and LSTMs
No ratings yet
Sequence Modeling with RNNs and LSTMs
125 pages
UNit4 CNN AY25-26
No ratings yet
UNit4 CNN AY25-26
77 pages
RNNs and Deep Learning in Life Sciences
No ratings yet
RNNs and Deep Learning in Life Sciences
97 pages
09 RNN Attention Transformers
No ratings yet
09 RNN Attention Transformers
34 pages
Deep Learning Evolution at Google
No ratings yet
Deep Learning Evolution at Google
69 pages
RNN and LSTM in Machine Learning
No ratings yet
RNN and LSTM in Machine Learning
89 pages
RNNs and RvNNs: Structures and Applications
No ratings yet
RNNs and RvNNs: Structures and Applications
25 pages
Recurrent Neural Networks Overview
No ratings yet
Recurrent Neural Networks Overview
92 pages
PPT, D L, R & Rnets, Unit - 4.
No ratings yet
PPT, D L, R & Rnets, Unit - 4.
42 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
32 pages
2025 AN2DL 04 RecurrentNeuralNetworks
No ratings yet
2025 AN2DL 04 RecurrentNeuralNetworks
48 pages
MIT 6.S191 - Recurrent Neural Networks
No ratings yet
MIT 6.S191 - Recurrent Neural Networks
84 pages
Deep Learning Curriculum Overview
No ratings yet
Deep Learning Curriculum Overview
23 pages
Parallelizing Linear RNN Training
No ratings yet
Parallelizing Linear RNN Training
9 pages
Recurrent and Recursive Neural Networks
No ratings yet
Recurrent and Recursive Neural Networks
35 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
13 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
61 pages
Advanced Machine Learning: RNNs Explained
No ratings yet
Advanced Machine Learning: RNNs Explained
126 pages
Module 4
No ratings yet
Module 4
51 pages
Lec 15
No ratings yet
Lec 15
28 pages
Deep Learning Course Plan 2024-2025
No ratings yet
Deep Learning Course Plan 2024-2025
10 pages
Deep Learning Sequence Models Overview
No ratings yet
Deep Learning Sequence Models Overview
63 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
20 pages
Vision Transformer
No ratings yet
Vision Transformer
33 pages
DS ML Machine Learning II
No ratings yet
DS ML Machine Learning II
8 pages
LSTM Networks: Understanding Memory in RNNs
No ratings yet
LSTM Networks: Understanding Memory in RNNs
38 pages
Adient Descent
No ratings yet
Adient Descent
38 pages
RNN Part1
No ratings yet
RNN Part1
42 pages
Neural Networks Overview and Applications
No ratings yet
Neural Networks Overview and Applications
41 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
77 pages
LSTM and GRU in Sequential Data Processing
No ratings yet
LSTM and GRU in Sequential Data Processing
64 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
29 pages
Transfer Learning for Maxillary Sinusitis Diagnosis
No ratings yet
Transfer Learning for Maxillary Sinusitis Diagnosis
22 pages
Expert System For Diagnosing Skin Disease
No ratings yet
Expert System For Diagnosing Skin Disease
5 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
51 pages
YOLO-Based Soldering Fault Detection
No ratings yet
YOLO-Based Soldering Fault Detection
69 pages
SVTR: Efficient Scene Text Recognition
No ratings yet
SVTR: Efficient Scene Text Recognition
7 pages
POP-YOLOv8: Nighttime Pedestrian Detection
No ratings yet
POP-YOLOv8: Nighttime Pedestrian Detection
24 pages
Traffic Sign Detection Project Report
No ratings yet
Traffic Sign Detection Project Report
68 pages
Automated Exam Seating Planner
No ratings yet
Automated Exam Seating Planner
57 pages
Smart Traffic Management for Emergency Vehicles
No ratings yet
Smart Traffic Management for Emergency Vehicles
6 pages
Computer Vision and Deep Learning Course
No ratings yet
Computer Vision and Deep Learning Course
3 pages
AML Detection via AlexNet Model
No ratings yet
AML Detection via AlexNet Model
8 pages
CNN-ALSTM Model for Silicon Resistivity
No ratings yet
CNN-ALSTM Model for Silicon Resistivity
10 pages
Vehicle Number Plate Detection Project
0% (1)
Vehicle Number Plate Detection Project
38 pages
Multimodal AI Solutions for Driver Behavior
No ratings yet
Multimodal AI Solutions for Driver Behavior
2 pages
CNNs in Data Mining and Machine Learning
No ratings yet
CNNs in Data Mining and Machine Learning
10 pages
RISE - 3D Perception Makes Real-World Robot Imitation Simple and Effective
No ratings yet
RISE - 3D Perception Makes Real-World Robot Imitation Simple and Effective
10 pages
AI-Driven Topology Optimization Method
No ratings yet
AI-Driven Topology Optimization Method
10 pages
Real-Time Brain Tumor Detection Using DL & IoT
No ratings yet
Real-Time Brain Tumor Detection Using DL & IoT
26 pages
Big Data in Agriculture: Rainfall & Yield Prediction
No ratings yet
Big Data in Agriculture: Rainfall & Yield Prediction
20 pages
MobileNet Architecture for Image Classification
No ratings yet
MobileNet Architecture for Image Classification
3 pages
Computer Vision Exam Prep Guide
No ratings yet
Computer Vision Exam Prep Guide
11 pages
Deep Learning in Deepfake Detection Analysis
No ratings yet
Deep Learning in Deepfake Detection Analysis
7 pages
Minimalistic Models for Retinal Vessel Segmentation
No ratings yet
Minimalistic Models for Retinal Vessel Segmentation
13 pages
Deep Learning in Lymphoma Diagnosis
No ratings yet
Deep Learning in Lymphoma Diagnosis
8 pages
Fundus-DeepNet: Ocular Disease Detection
No ratings yet
Fundus-DeepNet: Ocular Disease Detection
11 pages
Intelligent Leaf Disease Detection System
No ratings yet
Intelligent Leaf Disease Detection System
6 pages
OpenGait: Advancing Gait Recognition
No ratings yet
OpenGait: Advancing Gait Recognition
12 pages
Deep Learning CNN Concepts and Exercises
No ratings yet
Deep Learning CNN Concepts and Exercises
21 pages
1 s2.0 S092523122600439X Main
No ratings yet
1 s2.0 S092523122600439X Main
16 pages

Truncated BPTT in Deep Learning

Uploaded by

Truncated BPTT in Deep Learning

Uploaded by

CS7015 (Deep Learning) : Lecture 14

Sequence Learning Problems, Recurrent Neural Networks, Backpropagation

Department of Computer Science and Engineering

was always fixed

The network is now sensitive to the

For example a sequence of length

V V V V In such an ordered network, we can’t

s0 s1 s2 s3 s4 L4 (θ) This algorithm is called backpropaga-

∂st ∂st ∂st−1 ∂sk+1

Let us look at one such term in the

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W sj + b

We know how to compute ∂L∂st (θ)

You might also like