Recurrent Neural Networks Overview
Recurrent Neural Networks Overview
&
Visual Captioning
Lecture 17
Input: No
sequence Input: No sequence Input: Sequence Input: Sequence
Output: Sequence Output: No Output: Sequence
Output: No
sequence sequence
Example: Example: machine translation, video captioning, open-
Im2Caption Example: sentence ended question answering, video question answering
Example:
“standard” classification,
classification / multiple-choice
question answering
regression
problems
(C) Dhruv Batra Image Credit: Andrej Karpathy 3
Synonyms
• Recurrent Neural Networks (RNNs)
• Types:
– “Vanilla” RNNs
– Long Short Term Memory (LSTMs)
– Gated Recurrent Units (GRUs)
– …
• Algorithms
– BackProp Through Time (BPTT)
(C) Dhruv Batra Image Credit: Alex Graves and Kevin Gimpel 6
Even where you might not expect a sequence…
RNN
RNN
RNN
new state old state input vector at
some time step
some function x
with parameters W
RNN
y yt = Why ht + by
RNN
ht = tanh(Whh ht 1 + Wxh xt + bh )
h0 fW h1
x1
h0 fW h1 fW h2
x1 x2
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x
W
h0 fW h1 fW h2 fW h3 … hT
x1 x2 x3
W1
h0 fW h1 fW h2 fW h3 … hT fW h1 fW h2 fW …
x1 x2 x3
W1 W2
Loss
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax
Language Model
.00 .05 .68 .08
.84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a
time, feed back to
model
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax
Language Model
.00 .05 .68 .08
.84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a
time, feed back to
model
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax
Language Model
.00 .05 .68 .08
.84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a
time, feed back to
model
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax
Language Model
.00 .05 .68 .08
.84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a
time, feed back to
model
([Link]
566867f8291f086)
RNN
train more
train more
depth
time
W tanh
ht-1 stack ht
xt
Backpropagation from ht
to ht-1 multiplies by W
(actually WhhT)
W tanh
ht-1 stack ht
xt
h0 h1 h2 h3 h4
x1 x2 x3 x4
Computing gradient
of h0 involves many
factors of W
(and repeated tanh)
h0 h1 h2 h3 h4
x1 x2 x3 x4
h0 h1 h2 h3 h4
x1 x2 x3 x4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Backpropagation from
ct to ct-1 only
elementwise
multiplication by f, no
matrix multiply by W
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
...
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128 / 2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Similar to ResNet!
Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
Linear
Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
RNN
<start>
P(next)
RNN
Two
P(next)
RNN
people
P(next)
RNN
and
P(next)
RNN
two
P(next)
Neural Image Captioning
RNN
P(next)
horses.
58
(C) Dhruv Batra
Image Embedding (VGGNet)
4096-dim
Linear
Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
RNN
<start>
P(next)
RNN
y1
Two
P(next)
RNN
y2
people
P(next)
RNN
y3
and
P(next)
RNN
y4
two
P(next)
Neural Image Captioning
RNN
y5
P(next)
horses.
59
Sequence Model Factor Graph
y1 y2 y3 y4 y5
..
P (yt | y1 , . . . , yt 1)
Fei-Fei
Fei-Fei Li
Li &
& Justin
Justin Johnson
Johnson &
& Serena Yeung Lecture 10 - May 4, 2017
X
test image
x0
<STA
RT>
<START>
test image
y0
before:
h = tanh(Wxh * x + Whh * h)
h0
Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<STA
RT>
v <START>
test image
y0
sample!
h0
x0
<STA straw
RT>
<START>
test image
y0 y1
h0 h1
x0
<STA straw
RT>
<START>
test image
y0 y1
sample!
h0 h1
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
h0 h1 h2
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
sample
<END> token
h0 h1 h2 => finish.
x0
<STA straw hat
RT>
<START>
Captions generated using neuraltalk2
A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass
Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track
Fei-Fei
Fei-Fei Li
Li &
& Justin
Justin Johnson
Johnson &
& Serena Yeung Lecture 10 - May 4, 2017
A bird is perched on
a tree branch
A woman is holding a
cat in her hand
A man in a
baseball uniform
throwing a ball
A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk
Fei-Fei
Fei-Fei Li
Li &
& Justin
Justin Johnson
Johnson &
& Serena Yeung Lecture 10 - May 4, 2017
From Captions to Visual Concepts and Back, Hao Fang∗ Saurabh Gupta∗ Forrest Iandola∗ Rupesh K. Srivastava∗, Li Deng Piotr
Dollar, Jianfeng Gao Xiaodong He, Margaret Mitchell John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig, CVPR 2015.
Show, Attend and Tell
-
-
-
αt,i
78
αt,i
79
αt,i
80
αt,i
81
●
● α
82
●
α
- α
-
-
- Set up as reinforcement learning problem:
- Action = which area to attend to next
- Reward = log-likelihood of caption wrt to target sentence
Examples
How to Evaluate different captions?
-
-
○
○
-
○
○
-
-
-
-