Unsupervised Learning
Unsupervised Learning
Each data point in the data set recognizes themselves by competing for
representation.
SOM mapping steps starts from initializing the weight vectors.
From there a sample vector is selected randomly and the map of weight
vectors is searched to find which weight best represents that sample.
Each weight vector has neighboring weights that are close to it.
The weight that is chosen is rewarded by being able to become more like
that randomly selected sample vector.
The neighbors of that weight are also rewarded by being able to become
more like the chosen sample vector.
This allows the map to grow and form different shapes.
Most generally, they form square/rectangular/hexagonal/L shapes in 2D
feature space.
D(1)=0.4
D(2)=2.04
1 W11=0.1
Y1
W21=0.2
0 W31=0.8
0.9
0 W12=0.9
W22=0.7
W32=0.5
Y2
0
W42=0.3
H.W.
neighborhood Over time the neighborhood will
shrink to the size of just one
node
Autoencoders
Autoencoders
Unsupervised learning
Autoencoders
Unsupervised learning
Representation learning
Autoencoders: The sneaky idea
x1
Bottleneck x’1
x2
x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
Autoencoder = Encoder + Decoder
Input layer Output layer
x1 x’1
Bottleneck
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
x1 x’1
Bottleneck
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
x1 x’1
Bottleneck
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
x1 x’1
Bottleneck
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
● Backpropagation
● Backpropagation
v1
v2
Deep Convolutional Autoencoder
● Similar architecture to AE
● Convolutional layers
● Encoder: Convolution + Leaky Relu +Batch
normalization
● Denoising
● Anomaly detection
● ...
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x1 x’1
x2 x’2
x3 x’3
x4 x’4
x5 x’5
Generation with AEs
x’1
x’2
x’3
x’4
x’5
Generation with AEs
x’1
Sample a point in the latent
space and pass it through
x’2
the decoder
x’3
x’4
x’5
Denoising with AEs
x1 x’1
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
Denoising with AEs
x1 x’1
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
Denoising with AEs
x1 x’1
x3 x’3
v2
x4 x’4
x5 x’5
Denoising with AEs
x1 x’1
x3 x’3
v2
x4 x’4
x5 x’5
Denoising with AEs
x1 x’1
x3 x’3
v2
x4 x’4
x5 x’5
Denoising with AEs
Measure
reconstruction error
x1 x’1
of output against
Add noise to audio original audio
x2 x’2
v1
x3 x’3
v2
x4 x’4
x5 x’5
Anomaly detection with AEs
Auto encoders
Autoencoder
3/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Undercomplete Autoencoder
4/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Overcomplete Autoencoder
h = g(W x i + b)
x̂ i = f (W ∗ h + c)
5/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Let us consider the case when dim(h) ≥ dim(x i )
xˆi
In such a case the autoencoder
W∗
could learn a trivial encoding by
h simply copying x i into h
and then copying h into xˆi
W
Such an identity encoding is useless in
xi practice as it does not really tell
us anything about the important
h = g(W x i + b) characteristics of the data
x̂ i = f (W ∗ h + c)
5/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
PCA vs Encoders
Encoder
PCA
PCA vs Encoders
16/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Regularization in Autoencoder
Denoising Autoencoder
A denoising encoder simply corrupts the
input data using a probabilistic
xˆi
process (P (x˜i j |x i j )) before feeding it
to the network
h P (x˜i j = 0 |xi j ) = q
P (x̃ i j = x i j |x i j ) = 1 − q
Mitesh M . K h a p r a
Regularization need in Autoencoder
While poor generalization could happen
xˆi even in undercomplete autoencoders it is
an even more serious problem for
W∗ overcomplete auto encoders
h Here, (as stated earlier) the model can
simply learn to copy x i to h and then h
W to xˆi
xi To avoid poor generalization, we need to
introduce regularization
24/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
The simplest solution is to add a L 2 -
xˆi regularization term to the
objective function
W∗
m n
1 2 2
h min
∗
Σ Σ (x̂ i j − x i j ) + λ ǁθǁ
θ,w,w ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L ( θ)
∂W (and similarly for other para-
meters)
25/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Another trick is to tie the weights
xˆi of the encoder and decoder i.e., W ∗ =W T
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular
izer
W
xi
26/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
Sparse autoencoder
A hidden neuron with sigmoid activation will
xˆi
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure
xi the hidden neuron is inactive most of the
times.
41/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
xˆi If the neuron l is sparse (i.e. mostly inactive)
then ρˆl→ 0
xi
42/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
xˆi If the neuron l is sparse (i.e. mostly inactive)
then ρˆl→ 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say,
h
0.005) and tries to enforce the constraint ρˆl =
ρ
xi One way of ensuring this is to add the follow-
ing term to the objective function
42/1
CS7015 (Deep Learning) : Lecture 7
Ω(θ) = ǁ J x(h)ǁ 2
x
where J x ( h ) is the Jacobian of the encoder. J=1.................................n
47/1
CS7015 (Deep Learning) : Lecture 7
th
Mitesh M . K h a p r a
n k
∂h l 2
B y putting these two contradicting ǁJ 2
x (h)ǁ F = Σ Σ
∂xj
objectives against each other we en- j = 1 l=1
50/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
xˆi
Regularization
2
h Ω(θ) = λǁθǁ Weight decaying
Σk ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l= 1
x˜i Σn Σk ∂h l 2
Ω(θ) = Contractive
P (x̃ i j |x i j ) j=1 l=1
∂xj
xi
55/1
CS7015 (Deep Learning) : Lecture 7
Mitesh M . K h a p r a
applications of auto-encoder
► Dimensionality Reduction.
► Feature Extraction.
► Image Denoising.
► Image Compression.
► Anomaly Detection.
Image Search:
► Discriminator function
Linear factor methods such as probablistic PCA
and Factor analysis,
Independent component analysis
► Using average x and y translation we can measure
centroid of data
► Shift centroid to origin
► Find line passing through origin fitting all data point
Thank you !
129
Let us look at the full
D ec o der
architecture
A man ⟨ stop ⟩
throwing A CNN is first used to encode
. . . yt P ( y t = j |yt− 1
, I)
1
the image
V V V V
A RNN is then used to decode
(generate) a sentence from the
W W W
. . .W sT encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xT
Both the encoder and decoder
< GO> A man park
use a neural network
CNN
V V V V
Encoder:
s0 = CNN(xi)
W W W
. . .W sT
E n co der
Decoder:
U U U U
h0
st = RNN(st− 1 , e(ŷt− 1 ))
. . . xT
Model:
Encoder:
ht = RNN(ht− 1 , CNN(xit ))
. . .
Decoder:
s = hT
. . . P(y|I) = softmax(Vs + b)
Parameters: V, b, W conv , Uenc , W enc , b
CNN CNN . . . CNN
Loss:
L (θ) = − log P(y = ℓ|Video)
. . .
Algorithm: Gradient descent with
backpropagation