Structure of an Autoencoder
An autoencoder is a neural network that is trained to attempt to
copy its input to its output.
It has a hidden layer h that describes a code used to represent the
input.
The network may be viewed as consisting of two parts:
An encoder function h = f(x)
A decoder that produces a reconstruction r = g(h).
Unsupervised Learning
• Data: no labels!
• Goal: Learn the structure of the data
• Traditionally, autoencoders were used for dimensionality
reduction or feature learning.
Autoencoders
An autoencoder is a type of neural network architecture designed
to efficiently compress (encode) input data down to its essential
features, then reconstruct (decode) the original input from this
compressed representation
An autoencoder typically consists of three blocks
Encoder layer to compress the input data into a compressed
representation.
Bottleneck layer or code to represent the compressed input.
Decoder layer to reconstruct the encoded image back to the
original dimension
Auto-Encoders are designed to learn a lower-dimensional
representation for a higher-dimensional data. Applications include
1. Dimensionality Reduction 2. Feature Extraction 3. Image
Denoising 4. Image Compression 5. Image Search 6. Anomaly
Detection 7. Missing Value Imputation
Too much Capacity
• An autoencoder with a one-dimensional code but a very powerful
nonlinear encoder could learn to represent each training example
x(i) with the code i.
• The decoder could learn to map these integer indices back to the
values of specific training examples.
• An autoencoder trained to perform the copying task can fail to
learn anything useful about the dataset if the capacity of the
autoencoder is allowed to become too great.
Autoencoders- Architecture
An autoencoder has a structure very similar to a feedforward neural
network, the primary difference being that the number of neurons
in the output layer are equal to the number of inputs.
What does an autoencoder learn?
Learning g(f (x)) = x everywhere is not useful
Autoencoders are designed to be unable to copy perfectly.
Autoencoders learn useful properties of the data: can prioritize
which aspects of input should be copied.
Can learn stochastic mappings. they go beyond deterministic
functions to mappings pencoder(x|h) and pencoder(h|x)
Autoencoders- Training There are 4 hyperparameters that we need
to set when training an autoencoder:
1. Code size: number of nodes in the middle layer. by choosing
smaller size than the input dimension results in compression.
Smaller code size, higher compression.
2. Number of layers: the autoencoder can be as deep as we like. In
the figure below we have 2 layers in both the encoder and
decoder, without considering the input and output.
3. Number of nodes per layer: the standard autoencoder
architecture we have shown in examples is called a stacked
autoencoder since the layers are stacked one after another. Usually
stacked autoencoders look like a “sandwitch”, with the number of
nodes per layer decreases with each subsequent layer of the
encoder, and increases back in the decoder. Also the decoder is
symmetric to the encoder in terms of layer structure. This is not
necessary and we have total control over these parameters.
4. Loss function: Usual choices are the mean squared error (MSE)
or binary crossentropy. If the input values are in the range [0, 1]
then we typically use crossentropy, otherwise we use the mean
squared error.
Overcomplete Autoencoders
• One way to obtain useful features from the autoencoder is to
constrain h to have bigger dimension than x.
• An autoencoder whose code dimension is bigger than the input
dimension is called overcomplete
• For Overcomplete Autoencoders a linear encoder and linear
decoder can learn to copy the input to the output without learning
anything useful about the data distribution.
• A regularized autoencoder can be nonlinear and overcomplete but
still learn something useful about the data distribution even if the
model capacity is great enough to learn a trivial identity function.
Autoencoders vs Encoder-decoders
Though all autoencoders include both an encoder and a decoder,
not all encoder-decoder models are autoencoders.
Encoder-decoder frameworks are used in a variety of deep learning
models to extracts/encode features of the input data and take the
extracted feature data to the decoder for tasks such as classification
or segmentation. In applications of such encoder-decoder models,
the output of the neural network is different from its input.
For example, in segmentation models like U-Net, the encoder
extracts features from the input to determine pixel classification;
using the feature map and those pixel-wise classifications, the
decoder then constructs segmentation masks for each object in the
image. The goal of these encoder-decoder models is to label pixels
by their semantic class. they are trained via supervised learning,
optimizing the model’s predictions against ground truth images.
Autoencoders are flexible neural networks that can be customized
for various tasks. They come in different forms, each with unique
strengths and limitations.
Vanilla Autoencoders: Basic autoencoders that efficiently encode
and decode data.
Denoising Autoencoders: Improved robustness to noise and
irrelevant information.
Sparse Autoencoders: Learn more compact and efficient data
representations.
Contractive Autoencoders: Generate representations less
sensitive to minor data variations.
Variational Autoencoders: Generate new data points that
resemble in some form the training data. The choice of autoencoder
depends on the specific task..
INTERPRETATION --UNDERCOMPLETE
An autoencoder is optimized to perform as close to perfect
reconstruction as possible.
In many applications, the goal is to create a reduced set of codings
that adequately represents the inputs X ∈ X. Consequently, we
constrain the hidden layers so that the number of neurons is less
than the number of inputs.
An autoencoder whose internal representation has a smaller
dimensionality than the input data is an undercomplete
autoencoder.
In an undercomplete autoencoder, the compression of the hidden
layers forces the autoencoder to capture the most dominant
features of the input data and the representation of these signals
are captured in the codings.
Remark: When the autoencoder uses only linear activation
functions and the loss function is MSE, then the autoencoder learns
to span the same subspace as Principal Component Analysis (PCA).
When nonlinear activation functions are used, autoencoders
provide nonlinear generalizations of PCA The reduced codings we
extract using an undercomplete autoencoder are sometimes
referred to as deep features (DF) and they are similar in nature to
the principal components for PCA.
Interpretation example
Example. The following example demonstrates an implementation
of a basic undercomplete autoencoder with three fully connected
hidden layers that we apply to find a reduced representation of the
MNIST dataset.
We use a single hidden layer with only two codings. This is reducing
784 features down to two dimensions; although not very realistic, it
allows us to visualize the results and gain some intuition on the
algorithm
We project the MNIST response variable onto the reduced feature
space and compare our autoencoder to PCA
Figure shows that the nonlinear dimensionality reduction of the
autoencoder can help to isolate the signals in the features better
than PCA
Autoencoder
The goal of the bottleneck is to prevent the autoencoder from
overfitting to its training data
Without sufficiently limiting the capacity of the bottleneck, the
network tends toward learning the identity function between the
input and output: in other words, it may learn to minimize
reconstruction loss by simply copying the input directly. By forcing
the data to be significantly compressed, the neural network must
learn to retain only the features most essential to reconstruction
If the encoder and decoder have a high enough capacity—that is, if
they are processing large or complex data inputs, then the
autoencoder (even with a bottleneck) may still learn the identity
function anyway, making it useless.
This makes undercomplete autoencoders inflexible and limits their
capacity
Autoencoder failings
Autoencoders may fail to learn anything useful in the following
cases.
1. Hidden code h has dimension equal to input X.
2. Even in the case of an undercomplete autoencoder, the capacity
of encoder/decoder is too high.
Capacity controlled by depth
3. Overcomplete case: hidden code h has dimension greater than
input X.
Notice that even in the overcomplete case the autoencoder may
learn useful features from the data
Denoising Autoencoders
The Denoising AutoEncoder (DAE) is a stochastic version of the
autoencoder in which we train the autoencoder to reconstruct the
input from a corrupted copy of the inputs.
This forces the codings to learn more robust features of the inputs
and prevents them from merely learning the identity function; even
if the number of codings is greater than the number of inputs.
We can think of a denoising autoencoder as having two objectives:
1. try to encode the inputs to preserve the essential signals;
2. try to undo the effects of a corruption process stochastically
applied to the inputs of the autoencoder.
The latter can only be done by capturing the statistical
dependencies between the inputs. Combined, this denoising
procedure allows us to implicitly learn useful properties of the
inputs
The corruption process typically follows one of the following
approaches
additive Gaussian noise
masking noise: a fraction of the input is randomly chosen and set
to 0; this can be done by manually imputing zeros or ones into the
inputs or adding a dropout layer between the inputs and first
hidden layer;
salt-and-pepper noise: a fraction of the input is randomly chosen
and randomly set to its minimum or maximum value
Training a denoising autoencoder is nearly the same process as
training a regular autoencoder.
The only difference is we supply our corrupted inputs as training set
and supply the non-corrupted inputs as ground truth.
Formally, the DAE is associated to a different loss function as
compared to a ”vanilla” autoencoder.
Regularized Autoencoders
Regularized autoencoders address the shortcomings of
undercomplete autoencoders by introducing regularization
Various techniques exist to prevent autoencoders from learning
the identity function, to reduce overfitting and to improve their
ability to learn useful features or functions
Denoising Autoencoders
Sparse Autoencoders
Contractive Autoencoders
Stochastic Encoders and Decoders
Data concentrates around a low-dimensional manifold or a small set
of such manifolds
Manifold is a topological space (which may also be a separated
space) which locally resembles real n-dimensional space in, for
example the real coordinate space Rn is the prototypical n-manifold,
a circle is a compact 1-manifold.
Autoencoders take this idea further and aim to learn the structure
of the manifold.
An important characterization of a manifold is the set of its tangent
planes. At a point x on a d-dimensional manifold, the tangent plane
is given by d basis vectors that span the local directions of variation
allowed on the manifold
• Reconstruction 𝑥 computed from the corrupted input 𝑥.
• Loss function compares 𝑥 reconstruction with the noiseless 𝑥.
• The autoencoder cannot fully trust each feature of 𝑥
independently so it must learn the correlations of 𝑥’s features.
• Based on those relations we can predict a more ‘not prune to
changes’ model.
• We are forcing the hidden layer to learn a generalized structure of
the data.
Denoising Autoencoders – process
Sparse Autoencoders
An autoencoder that has been regularized to be sparse must
respond to unique statistical features of the dataset it has been
trained on, rather than simply acting as an identity function.
• In this way, training to perform the copying task with a sparsity
penalty can yield a model that has learned useful features
• One way to achieve actual zeros in h for sparse autoencodersis to
use rectified linear units to produce the code layer.
• With a prior that actually pushes the representations to zero λ ·
∥w∥1, one can indirectly control the average number of zeros in the
representation
• Limit capacity of autoencoder by adding a term to the cost
function penalizing the code for being larger
• Special case of variational autoencoder
• Probabilistic model
• Laplace prior corresponds to l1 sparsity penalty Dirac variational
posterior
Sparse Autoencoders
Sparse autoencoders are designed to pull out the most influential
feature representations of the input data by using a sparsity
constraint such that only a fraction of the nodes would have
nonzero values.
Since it is impossible to design a neural network with a flexible
number of nodes at its hidden layers, sparse autoencoders work by
penalizing the activation of some neurons in hidden layers. It means
that a penalty directly proportional to the number of neurons
activated is applied to the loss function.
The codes fθ(X) for messages tend to be sparse codes, that is, fθ(X)
is close to zero in most entries. Sparse autoencoders may include
more (rather than fewer) hidden units than inputs, but only a small
number of the hidden units are allowed to be active at the same
time
There are two main ways to enforce sparsity
The k-sparse autoencoder clamps all but the highest-k activations of
the latent code to zero.
The k-sparse autoencoder inserts the following ”k-sparse function”
in the latent layer of a standard autoencoder:
Single-layer sparse autoencoder
The hidden nodes in bright yellow are activated, while the light
yellow ones are inactive. The activation depends on the input
Sparse Codes: Associative Memory
• The ability to correct faults if false information is given
• To complete information if some parts are missing
• To interpolate information, that means if a pattern is not stored
the most similar stored pattern is determined
--
• The cerebral cortex is a huge associative memory
• or rather a large network of associatively connected
topographical areas
• Associations between patterns are formed by Hebbian learning
--
The patterns are represented by binary sparse vectors
• The presence of a feature is indicated by a one component of the
vector, its absence through a zero component of the vector
• Always two pairs of these vectors are associated
• This process of the association is called learning
Sparse Codes: Storage Analysis
• For an estimation of the asymptotic number L of vector pairs (x, y)
which can be stored in an associative memory before it begins to
make mistakes in retrieval phase.
• It is assumed that both vectors have the same dimension n
• It is also assumed that both vectors are composed of M 1s, which
are likely to be in any coordinate of the vector
The weight matrix after learning of 20000 test patterns, in which ten
ones were randomly set in a 2000 dimensional vector represents a high
loaded matrix with equally distributed weights
Variational autoencoder
Variational AutoEncoders (VAEs) belong to the families of
variational Bayesian methods.
Despite the architectural similarities with basic autoencoders, VAEs
have different goals and a different mathematical formulation.
The fundamental difference between VAEs and other types of
autoencoders is that while most autoencoders learn discrete latent
space models, VAEs learn continuous latent variable models. The
latent space of a VAE is typically composed by a mixture of
distributions instead of a fixed vector.
Given an input dataset x characterized by an unknown probability
function P(x) and a multivariate latent encoding vector z, the
objective is to model the data as a distribution pθ(x), with θ defined
as the set of the network parameters so that
Because the latent variables of the VAE capture attributes as a
probability distribution — learning the latent distribution — the VAE
is generative AI models.
By learning to encode important features from the inputs in the
datasets, VAEs can samples points from the latent distribution and
feed them to the decoder to generate new samples that resemble
the original training data
VAEs are useful in applications such as
Image Generation and Synthesis. VAEs allow for image generation
by learning rich latent representations, enabling the creation of high-
quality, diverse, and realistic images. Applications range from
generating art and photorealistic images to enhancing image quality
and data augmentation for computer vision tasks.
Anomaly Detection. In anomaly detection, VAEs excel at learning
the underlying structure of normal data, enabling the identification of
anomalies as deviations from known patterns.
Representation Learning. VAEs facilitate unsupervised learning of
meaningful representations from data. They extract essential features
and capture latent relationships within complex datasets, aiding
downstream tasks like classification, clustering, and recommendation
systems.
Molecular Design. In drug discovery, VAEs assist in generating
novel molecular structures by navigating the chemical space. They aid
in molecule generation, optimization, and de novo molecular design,
accelerating drug development processes
Variational autoencoder- Loss function
One of the key aspects of VAE is the loss function. Most commonly,
it consists of two components:
The reconstruction loss measures how different the
reconstructed data are from the original data. As reconstruction
loss, mean squared error and cross entropy are often used.
The KL-divergence tries to regularize the process and keep the
reconstructed data as diverse as possible. The KL divergence is
measured between the probability distribution of training data (the
prior distribution) and the distribution of latent variables learned by
the VAE (the posterior distribution
The regularized loss function enables VAEs to generate new samples
that resemble the data it was trained on.
To generate a new sample, the VAE samples a random latent vector
from within the unit Gaussian — in other words, selects a random
starting point from within the normal distribution — shifts it by the
mean of the latent distribution and scales it by the variance of the
latent distribution.
This process, called the reparameterization trick, avoids direct
sampling of the variational distribution: because the process is
random, it has no derivative, hence eliminates the need for
backpropagation.
Example. We train a VAE to be used as a generative model for
generating digits. This means that we use the decoder to generate
data similar to the data used determine the latent space.
We design it using fully-connected encoders and decoders and we
train it using some images from the MNIST dataset.
The input dimension is 784 which is the flattened dimension of
MNIST images (28×28).
In the encoder, the mean µ and variance σ2 vectors are our
variational representation vectors.
The final encoder dimension has dimension 2 which are the µ and
σ2 vectors. These continuous vectors define our latent space
distribution that allows us to sample images in VAE
Variational autoencoder- Mathematical formulation