MODULE 4 — AUTOENCODERS (AE) & VARIATIONAL
AUTOENCODERS (VAE)
Prepared from your uploaded lecture slides and notes.
1. Introduction to Autoencoders
An Autoencoder (AE) is a special type of feedforward neural network used for:
● feature learning
● dimensionality reduction
● data compression
● denoising
Goal:
● reconstruct the input at output layer.
Basic Structure of Autoencoder
Autoencoder contains:
. Encoder
. Latent Representation (Hidden Layer)
. Decoder
Encoder
Encoder converts input into compressed representation.
Formula:
h=g(Wx+b)
Where:
● (x) = input
● (W) = weights
● (b) = bias
● (h) = hidden representation
Decoder
3
Decoder reconstructs original input from hidden representation.
2
Formula:
1
\hat{x}=f(W^{*}h+c)
Where:
● (\hat{x}) = reconstructed input
● (W^*) = decoder weights
● (c) = bias
Working of Autoencoder
Step 1
Input data fed to encoder.
Step 2
Encoder compresses input.
Step 3
Hidden layer stores important features.
Step 4
Decoder reconstructs original data.
Step 5
Loss calculated between original and reconstructed data.
2. Types of Autoencoders
. Undercomplete Autoencoder
. Overcomplete Autoencoder
. Denoising Autoencoder
. Sparse Autoencoder
. Variational Autoencoder
3. Undercomplete Autoencoder
Condition:
[
dim(h) < dim(x)
]
Meaning:
hidden layer smaller than input layer.
Advantages
. Learns compressed representation
. Removes redundancy
. Similar to PCA
Important Point
If reconstruction is still accurate:
● hidden representation captures important characteristics of data.
Relation with PCA
Undercomplete AE behaves similar to:
Principal Component Analysis (PCA)
Both perform dimensionality reduction.
Difference:
● PCA is linear
● Autoencoder can learn non-linear features
4. Overcomplete Autoencoder
Condition:
[
dim(h) \ge dim(x)
]
Meaning:
hidden layer equal or larger than input.
Problem
Model may simply copy:
[
x \rightarrow h \rightarrow \hat{x}
]
This is called:
Identity Mapping
Disadvantage
Does not learn meaningful features.
Needs:
Regularization
5. Choice of Activation Functions
For Binary Inputs
Best decoder activation:
Logistic/Sigmoid Function
Because output remains between:
[
0 \text{ and } 1
]
Sigmoid Formula
\sigma(x)=\frac{1}{1+e^{-x}}
For Real Valued Inputs
Use:
Linear Activation
Formula:
\hat{x}=W^{*}h+c
Reason:
real values are unrestricted.
6. Loss Function of Autoencoder
Goal:
make reconstructed output close to original input.
Mean Squared Error (MSE)
L=\frac{1}{m}\sum_{i=1}^{m}(\hat{x}_i-x_i)^2
Matrix Form
L=(\hat{x}-x)^T(\hat{x}-x)
Objective
Minimize reconstruction error.
Training done using:
Backpropagation
7. Regularization in Autoencoders
Regularization prevents:
● overfitting
● identity mapping
Especially important in overcomplete autoencoders.
L2 Regularization
Objective Function:
L=\frac{1}{m}\sum(\hat{x}-x)^2+\lambda ||\theta||^2
Where:
● (\lambda) = regularization parameter
Advantages
. Better generalization
. Reduces overfitting
. Controls weight magnitude
Tied Weights
Another regularization technique:
[
W^* = W^T
]
Meaning:
decoder weights are transpose of encoder weights.
Advantages
. Fewer parameters
. Better learning
. Reduced complexity
8. Denoising Autoencoder (DAE)
DAE intentionally corrupts input before training.
Goal:
reconstruct original clean input.
Working
Step 1
Add noise to input.
Step 2
Feed noisy input to encoder.
Step 3
Decoder reconstructs clean input.
Corrupted Input
[
\tilde{x}
]
Original Input:
[
x
]
Noise Addition
One method:
[
P(\tilde{x}{ij}=0|x{ij})=q
]
Meaning:
with probability (q), input becomes zero.
Another Method
Gaussian noise:
\tilde{x}=x+N(0,1)
Why Denoising Helps?
Instead of memorizing:
model learns actual structure and patterns.
Advantages of DAE
. Robust feature learning
. Better generalization
. Noise removal
. Improved feature extraction
Applications of DAE
. Image denoising
. Speech enhancement
. Music separation
. Pattern recognition
Important Observation
As corruption increases:
● filters become more meaningful
● but too much corruption lowers reconstruction quality.
9. Sparse Autoencoder
Sparse AE forces neurons to remain inactive most of the time.
Idea
Neuron activation should mostly remain near:
[
0
]
Average Activation
\hat{\rho}=\frac{1}{m}\sum_{i=1}^{m}h_l(x_i)
Where:
● (\hat{\rho}) = average activation
Sparsity Constraint
Desired sparsity:
[
\rho \approx 0
]
Typically:
[
\rho = 0.005
]
Sparsity Penalty
\Omega(\theta)=\sum_{l=1}^{k}\rho\log\frac{\rho}{\hat{\rho_l}}+(1-\rho)
\log\frac{1-\rho}{1-\hat{\rho_l}}
Advantages of Sparse AE
. Learns meaningful features
. Better representation learning
. Avoids trivial copying
10. Variational Autoencoder (VAE)
VAE is an advanced generative autoencoder.
Difference:
Instead of mapping input to fixed vector,
VAE maps input to:
Probability Distribution
Goal of VAE
. Learn latent distribution
. Generate new samples
Architecture of VAE
. Encoder
. Latent Distribution
. Sampling
. Decoder
Encoder in VAE
Encoder predicts:
● Mean ((\mu))
● Variance ((\Sigma))
of latent distribution.
Assumption
Latent variables follow:
Gaussian Distribution
[
N(0,I)
]
Latent Variable
[
z
]
is sampled from learned distribution.
Decoder in VAE
Decoder generates reconstructed sample from:
[
z
]
VAE Objective
Learn:
. Compression
. Generation
Loss Function of VAE
Two parts:
. Reconstruction Loss
. KL Divergence Loss
VAE Loss Function
L_i(\theta,\phi)=-E_{z\sim Q_\theta(z|x_i)}[\log P_\phi(x_i|z)]+KL(Q_\theta(z|
x_i)||P(z))
KL Divergence
Measures difference between:
● learned distribution
● actual distribution
Why KL Divergence?
Prevents encoder from memorizing each input separately.
Acts as:
Regularizer
Without Reconstruction Loss
Learned distribution may deviate from desired distribution.
Without KL Divergence
Model may cheat by learning narrow distributions.
With Both Terms
Model:
● reconstructs correctly
● learns smooth latent space
Applications of VAE
. Image generation
. Face generation
. Data augmentation
. Drug discovery
. Anomaly detection
Autoencoder vs Variational Autoencoder
Autoencoder Variational Autoencoder
Learns fixed encoding Learns probability distribution
Mainly compression Compression + generation
Deterministic Probabilistic
Cannot generate diverse samples Can generate new samples
easily
Simpler architecture More complex
Advantages of Autoencoders
. Dimensionality reduction
. Noise removal
. Feature learning
. Data compression
Applications of Autoencoders
. Image compression
. Recommendation systems
. Medical imaging
. Fraud detection
. Feature extraction
EXAM READY 5 MARKER ANSWERS
Q1. Explain Autoencoder Architecture.
Answer
An autoencoder is a feedforward neural network used to reconstruct input data.
Architecture contains:
. Encoder
. Hidden layer
. Decoder
Encoder compresses input:
h=g(Wx+b)
Decoder reconstructs input:
\hat{x}=f(W^{*}h+c)
The network minimizes reconstruction loss between input and output.
Applications:
● compression
● denoising
● feature learning
Q2. Differentiate Undercomplete and Overcomplete
Autoencoders.
Answer
Undercomplete AE Overcomplete AE
(dim(h)<dim(x)) (dim(h)\ge dim(x))
Learns compressed features May learn identity mapping
Better representation learning Needs regularization
Similar to PCA Risk of overfitting
Undercomplete autoencoders are generally more useful for feature extraction.
Q3. Explain Denoising Autoencoder.
Answer
A denoising autoencoder reconstructs clean data from noisy input.
Steps:
. Add noise to input.
. Feed noisy data to encoder.
. Decoder reconstructs original data.
Noise example:
\tilde{x}=x+N(0,1)
Advantages:
. Robust learning
. Better generalization
. Noise removal
Applications:
● image denoising
● speech enhancement
Q4. Explain Sparse Autoencoder.
Answer
Sparse autoencoder forces neurons to remain inactive most of the time.
Goal:
maintain sparse hidden representation.
Average activation:
\hat{\rho}=\frac{1}{m}\sum_{i=1}^{m}h_l(x_i)
Sparsity penalty is added to loss function.
Advantages:
. Meaningful features
. Better feature extraction
. Prevents trivial copying
Q5. Explain Variational Autoencoder (VAE).
Answer
VAE is a generative model that learns probability distributions instead of fixed
encodings.
Components:
. Encoder
. Latent distribution
. Sampling
. Decoder
Encoder predicts:
● mean
● variance
Loss function:
L_i(\theta,\phi)=-E_{z\sim Q_\theta(z|x_i)}[\log P_\phi(x_i|z)]+KL(Q_\theta(z|
x_i)||P(z))
Advantages:
. Generates new samples
. Smooth latent space
. Strong generative capability
Applications:
● image generation
● anomaly detection
Q6. Explain Regularization in Autoencoders.
Answer
Regularization prevents overfitting and identity mapping.
Methods:
. L2 regularization
. Tied weights
. Denoising
. Sparsity constraints
L2 regularization:
L=\frac{1}{m}\sum(\hat{x}-x)^2+\lambda ||\theta||^2
Advantages:
. Better generalization
. Reduced overfitting
. Improved feature learning
Q7. Explain Loss Function of VAE.
Answer
VAE loss contains two terms:
. Reconstruction Loss
. KL Divergence
Loss function:
L_i(\theta,\phi)=-E_{z\sim Q_\theta(z|x_i)}[\log P_\phi(x_i|z)]+KL(Q_\theta(z|
x_i)||P(z))
Reconstruction loss:
ensures accurate output reconstruction.
KL divergence:
ensures latent distribution remains close to normal distribution.
Together they help generate meaningful and diverse samples.
These notes are based on your uploaded lecture materials for Autoencoders
and Variational Autoencoders.