0% found this document useful (0 votes)
13 views138 pages

Unsupervised Learning

The document discusses unsupervised learning techniques, focusing on Self-Organizing Maps (SOM) and Autoencoders. SOM is an artificial neural network that maps high-dimensional data to lower dimensions for clustering and dimensionality reduction, while Autoencoders are designed to compress and reconstruct data, learning representations through a bottleneck architecture. Both methods are used for tasks such as generation, denoising, and anomaly detection, with Autoencoders being capable of learning non-linear relationships unlike PCA.

Uploaded by

meghdaves4ds
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views138 pages

Unsupervised Learning

The document discusses unsupervised learning techniques, focusing on Self-Organizing Maps (SOM) and Autoencoders. SOM is an artificial neural network that maps high-dimensional data to lower dimensions for clustering and dimensionality reduction, while Autoencoders are designed to compress and reconstruct data, learning representations through a bottleneck architecture. Both methods are used for tasks such as generation, denoising, and anomaly detection, with Autoencoders being capable of learning non-linear relationships unlike PCA.

Uploaded by

meghdaves4ds
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unsupervised Learning Algorithm

►Self Organizing Maps


►Kohonen Self-Organizing Feature
Maps
Unsupervised Learning
In many cases in which we do not have labeled data and need to find the
hidden patterns from the given dataset.
So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.
Self Organizing Map (or Kohonen Map or SOM)
• Type of Artificial Neural Network
• Inspired by biological models of neural systems form the 1970’s.
• It follows an unsupervised learning approach
• Trained its network through a competitive learning algorithm.
• SOM is used for clustering and mapping (or dimensionality
reduction) techniques
• Maps multidimensional data onto lower-dimensional which allows
people to reduce complex problems for easy interpretation.
• SOM has two layers, one is the Input layer and the other one is the
Output layer.
The Algorithm:

1. Each node’s weights are initialized.

2. A vector is chosen at random from the set of training data.

3. Every node is examined to calculate which one’s weights are

most like the input vector. The winning node is commonly

known as the Best Matching Unit (BMU).

4. Then the neighbourhood of the BMU is calculated. The amount

of neighbors decreases over time.


[Link] winning weight is rewarded with becoming more like the
sample vector. The nighbors also become more like the sample
vector. The closer a node is to the BMU, the more its weights get
altered and the farther away the neighbor is from the BMU, the less
it learns.

[Link] step 2 for N iterations.


Best Matching Unit is a technique which calculates the distance from
each weight to the sample vector, by running through all weight
vectors.
The weight with the shortest distance is the winner.
There are numerous ways to determine the distance, however, the most
commonly used method is the Euclidean Distance .
A self-organizing map (SOM) is a type of artificial neural network
(ANN) that is trained using unsupervised learning to produce a low-
dimensional (typically two-dimensional), discretized representation
of the input space of the training samples, called a map, and is
therefore a method to do dimensionality reduction.

Self-organizing maps differ from other artificial neural networks as


they apply competitive learning as opposed to error-correction
learning (such as backpropagation with gradient descent), and in the
sense that they use a neighborhood function to preserve the
topological properties of the input space.
What really happens in SOM ?

Each data point in the data set recognizes themselves by competing for
representation.
SOM mapping steps starts from initializing the weight vectors.
From there a sample vector is selected randomly and the map of weight
vectors is searched to find which weight best represents that sample.
Each weight vector has neighboring weights that are close to it.
The weight that is chosen is rewarded by being able to become more like
that randomly selected sample vector.
The neighbors of that weight are also rewarded by being able to become
more like the chosen sample vector.
This allows the map to grow and form different shapes.
Most generally, they form square/rectangular/hexagonal/L shapes in 2D
feature space.
D(1)=0.4

D(2)=2.04
1 W11=0.1
Y1
W21=0.2

0 W31=0.8
0.9

0 W12=0.9
W22=0.7
W32=0.5
Y2
0
W42=0.3
H.W.
neighborhood Over time the neighborhood will
shrink to the size of just one
node
Autoencoders
Autoencoders

Unsupervised learning
Autoencoders

Unsupervised learning

Representation learning
Autoencoders: The sneaky idea

Create an architecture with a bottleneck,


which ensures a lower-dimensional
representation of the original data.
Autoencoders: The sneaky idea

Create an architecture with a bottleneck,


which ensures a lower-dimensional
representation of the original data.
Input layer Output layer

x1
Bottleneck x’1

x2
x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Autoencoder = Encoder + Decoder
Input layer Output layer

x1 x’1

Bottleneck
x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5

Encoder = compress data into lower-dimensional representation (latent space)


Necessary condition to learn a representation

● Data should have dependencies across


dimensions
Necessary condition to learn a representation

● Data should have dependencies across


dimensions

● If dimensions are all independent -> impossible to learn


lower-dimensional representation
Input layer Output layer

x1 x’1

Bottleneck
x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5

Encoder = compress data into lower-dimensional representation (latent space)


Input layer Output layer

x1 x’1

Bottleneck
x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5

Decoder = Decompress representation back to original domain


Input layer Output layer

x1 x’1

Bottleneck
x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5

Decoder = Decompress representation back to original domain

Original data Reconstruction


How can we train an autoencoder?

● Backpropagation

● Minimise reconstruction error


How can we train an autoencoder?

● Backpropagation

● Minimise reconstruction error

Original data Reconstructed data


Expectation from autoencoder...

● Sensitive enough to input data to reconstruct it

● Insensitive enough to input data not to overfit it


Expectation from autoencoder......

● Sensitive enough to input data to reconstruct it

● Insensitive enough to input data not to overfit it


Deep Autoencoder

v1

v2
Deep Convolutional Autoencoder

● Similar architecture to AE

● Convolutional layers
● Encoder: Convolution + Leaky Relu +Batch
normalization

● Decoder: Convolution transpose + Leaky Relu + Batch


normalization
What is the point of compressing and decompressing data?

The latent space keeps the most important


attributes of the input data

We can leverage the latent space to perform


several interesting tasks
Autoencoder applications
● Generation

● Denoising

● Anomaly detection

● ...
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x1 x’1

x2 x’2

x3 x’3

x4 x’4

x5 x’5
Generation with AEs

x’1

x’2

x’3

x’4

x’5
Generation with AEs

x’1
Sample a point in the latent
space and pass it through
x’2
the decoder

x’3

x’4

x’5
Denoising with AEs

x1 x’1

x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Denoising with AEs

x1 x’1

x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Denoising with AEs

x1 x’1

Add noise to audio


x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Denoising with AEs

x1 x’1

Add noise to audio


x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Denoising with AEs

x1 x’1

Add noise to audio


x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Denoising with AEs

Measure
reconstruction error
x1 x’1
of output against
Add noise to audio original audio
x2 x’2
v1

x3 x’3
v2

x4 x’4

x5 x’5
Anomaly detection with AEs
Auto encoders
Autoencoder

An autoencoder is a special type of


xˆi feed forward neural network which
does the following
W∗
Encodes its input x i into a hidden
h representation h
Decodes the input again from this
W
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that xˆi is close to x i (we will see some
h = g(W x i + b)
such loss functions soon)
x̂ i = f (W ∗ h + c)

3/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Undercomplete Autoencoder

Let us consider the case where dim(h) < dim(x i )


xˆi
If we are still able to reconstruct xˆi
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of x i . It cap-
W
tures all the important characteristics
xi of x i
Do you see an analogy with P C A ?
h = g(W x i + b)
x̂ i = f (W ∗ h + c)

An autoencoder where dim(h) < dim(x i ) is


called an under complete autoencoder

4/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Overcomplete Autoencoder

Let us consider the case when


xˆi dim(h) ≥ dim(x i )
W∗
h
W
xi

h = g(W x i + b)
x̂ i = f (W ∗ h + c)

5/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Let us consider the case when dim(h) ≥ dim(x i )

xˆi
In such a case the autoencoder
W∗
could learn a trivial encoding by
h simply copying x i into h
and then copying h into xˆi
W
Such an identity encoding is useless in
xi practice as it does not really tell
us anything about the important
h = g(W x i + b) characteristics of the data
x̂ i = f (W ∗ h + c)

An autoencoder where dim(h) ≥ dim(x i ) is


called an over complete autoencoder

5/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
PCA vs Encoders

● Both perform dimensionality reduction


PCA vs Encoders

● Both perform dimensionality reduction

● PCA learns linear relationships


PCA vs Encoders

● Both perform dimensionality reduction

● PCA learns linear relationships

● Encoders can learn non-linear relationships


PCA vs Encoders

Encoder
PCA
PCA vs Encoders

● Both perform dimensionality reduction

● PCA learns linear relationships

● Encoders can learn non-linear relationships


● Encoder = PCA, if it uses linear activation functions
Autoencoder is equivalent to PCA if
xˆi y PCA

use a linear encoder


h ≡ u1 u2 use a linear decoder
use squared error loss function
normalize the inputs to
x
xi PT X T X P = D
m
xˆij = √ 1 x i j − 1 Σ x k j !
m m
k =1

16/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Regularization in Autoencoder
Denoising Autoencoder
A denoising encoder simply corrupts the
input data using a probabilistic
xˆi
process (P (x˜i j |x i j )) before feeding it
to the network

h P (x˜i j = 0 |xi j ) = q
P (x̃ i j = x i j |x i j ) = 1 − q

x˜i Aim: N/w should not memories training data


Hence it will work well on unseen data
P (x̃ i j |x i j )
xi How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) x i
m n
1
L=min Σ Σ (x̂ i j − x i j ) 2
m i=1 j = 1
28/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Regularization need in Autoencoder
While poor generalization could happen
xˆi even in undercomplete autoencoders it is
an even more serious problem for
W∗ overcomplete auto encoders
h Here, (as stated earlier) the model can
simply learn to copy x i to h and then h
W to xˆi
xi To avoid poor generalization, we need to
introduce regularization

24/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
The simplest solution is to add a L 2 -
xˆi regularization term to the
objective function
W∗
m n
1 2 2
h min

Σ Σ (x̂ i j − x i j ) + λ ǁθǁ
θ,w,w ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L ( θ)
∂W (and similarly for other para-
meters)

25/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Another trick is to tie the weights
xˆi of the encoder and decoder i.e., W ∗ =W T
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular
izer
W

xi

26/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
Sparse autoencoder
A hidden neuron with sigmoid activation will
xˆi
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure
xi the hidden neuron is inactive most of the
times.

41/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
xˆi If the neuron l is sparse (i.e. mostly inactive)
then ρˆl→ 0

xi

The average value of the


activation of a neuron l is given
by
m
1
ρˆl = Σ h(x i ) l
m
i= 1

42/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
xˆi If the neuron l is sparse (i.e. mostly inactive)
then ρˆl→ 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say,
h
0.005) and tries to enforce the constraint ρˆl =
ρ
xi One way of ensuring this is to add the follow-
ing term to the objective function

The average value of the


k
ρ 1−ρ
Ω(θ) = Σ ρ log + (1 − ρ) log
activation of a neuron l is given ρˆl 1 − ρ̂l
l= 1
by
m
1 When will this term reach its minimum
ρˆl = Σ h(x i ) l value and what is the minimum value?
m
i= 1

42/1
CS7015 (Deep Learning) : Lecture 7

The function will reach its minimum


Mitesh M . value(s)
K h a p r a when ρˆl = ρ.
Contractive autoencoder
A contractive autoencoder also tries to
prevent an overcomplete autoencoder from
learning the identity function. x̂
1.................................n

It does so by adding the following reg-


ularization term to the loss function h
l=1..............................................k

Ω(θ) = ǁ J x(h)ǁ 2
x
where J x ( h ) is the Jacobian of the encoder. J=1.................................n

47/1
CS7015 (Deep Learning) : Lecture 7

th
Mitesh M . K h a p r a
n k
∂h l 2
B y putting these two contradicting ǁJ 2
x (h)ǁ F = Σ Σ
∂xj
objectives against each other we en- j = 1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data

50/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
xˆi
Regularization
2
h Ω(θ) = λǁθǁ Weight decaying
Σk ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l= 1
x˜i Σn Σk ∂h l 2
Ω(θ) = Contractive
P (x̃ i j |x i j ) j=1 l=1
∂xj

xi

55/1
CS7015 (Deep Learning) : Lecture 7

Mitesh M . K h a p r a
applications of auto-encoder
► Dimensionality Reduction.
► Feature Extraction.
► Image Denoising.
► Image Compression.
► Anomaly Detection.
Image Search:

•Missing Value Imputation.


GAN Architecture
► Generator function

► Discriminator function
Linear factor methods such as probablistic PCA
and Factor analysis,
Independent component analysis
► Using average x and y translation we can measure
centroid of data
► Shift centroid to origin
► Find line passing through origin fitting all data point
Thank you !

129
Let us look at the full
D ec o der
architecture
A man ⟨ stop ⟩
throwing A CNN is first used to encode
. . . yt P ( y t = j |yt− 1
, I)
1
the image
V V V V
A RNN is then used to decode
(generate) a sentence from the
W W W
. . .W sT encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xT
Both the encoder and decoder
< GO> A man park
use a neural network
CNN

C S 7 0 1 5 (Deep Learning) : Lecture 1 6


D ec o der Let us look
architecture at the full
A man ⟨ stop ⟩
throwing A CNN is first used to encode
. . . yt P ( y t = j |yt− 1
, I)
1
the image
V V V V
A RNN is then used to decode
(generate) a sentence from the
W W W W
sT encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xT
Both the encoder and decoder
<GO> A man park
use a neural network
CNN
Alternatively, the encoder’s
output can be fed to every step
of the decoder

C S 7 0 1 5 (Deep Learning) : Lecture 1 6


Task: Image captioning
t− 1
Deco der L t ( θ ) = − l o g P ( yt = j|y 1 , f c 7 )

A man throwing . . . ⟨ stop ⟩


Data: {xi = image i , yi = caption i }Ni = 1
. . . yt P(y t =
t− 1
j|y1 , fc 7 ) Model:

V V V V
Encoder:
s0 = CNN(xi)
W W W
. . .W sT

E n co der
Decoder:
U U U U
h0
st = RNN(st− 1 , e(ŷt− 1 ))
. . . xT

< GO> A man park


P(y t|y t− 1
1 , I) = softmax(Vs t + b)
CNN
Parameters: Udec , V, W dec , W conv , b
Loss:
∑T ∑T
L (θ) = L t (θ) = − log P(yt = ℓ t|y t−
1
1
, I)
i= 1 t= 1

Algorithm: Gradient descent with


backpropagation

C S 7 0 1 5 (Deep Learning) : Lecture 1 6


o/p: Surya Namaskar
Task: Video Classification
S u r ya n a m a sk a r
Data: {xi = Video i , yi = Activity i} N
i= 1

Model:
Encoder:
ht = RNN(ht− 1 , CNN(xit ))
. . .
Decoder:
s = hT
. . . P(y|I) = softmax(Vs + b)
Parameters: V, b, W conv , Uenc , W enc , b
CNN CNN . . . CNN
Loss:
L (θ) = − log P(y = ℓ|Video)
. . .
Algorithm: Gradient descent with
backpropagation

C S 7 0 1 5 (Deep Learning) : Lecture 1 6


Critic =realness score
Type of distances
• JS divergence
• KL divergence
• Wassertein Distance
• Total Variation distance

You might also like