Feedforward Networks – Multilayer Perceptron (MLP)
A feedforward neural network is the simplest and most widely used architecture
in deep learning. In this model, information flows strictly in one direction: from the input
layer to one or more hidden layers, and finally to the output layer. There are no cycles or
feedback connections, which makes analysis and implementation comparatively simple. A
Multilayer Perceptron (MLP) is a feedforward network with at least one hidden layer and
nonlinear activation functions.
Conceptually, an MLP is a composition of several linear transformations followed
by nonlinearities. Let an input vector be 𝐱 ∈ ℝ𝑑 . A single hidden layer MLP with 𝐻 hidden
units and an output layer with 𝐾 units can be written as:
𝐡 = 𝜙(𝑊 (1) 𝐱 + 𝐛 (1) ), 𝐲ˆ = 𝜓(𝑊 (2) 𝐡 + 𝐛 (2) )
Here, 𝑊 (1) , 𝑊 (2) are weight matrices, 𝐛 (1) , 𝐛 (2) are bias vectors, 𝜙(⋅) is a hidden
layer activation function (e.g., ReLU, sigmoid, tanh), and 𝜓(⋅) is an output activation (e.g.,
softmax for multiclass classification, sigmoid for binary classification).
Architecture and Flow
A typical MLP architecture is:
• Input layer: One node per input feature (e.g., pixels of an image, attributes of a
sample).
• Hidden layer(s): One or more layers with fully connected neurons using nonlinear
activations.
• Output layer: Produces predictions – a single neuron for regression/binary
classification, or multiple neurons (one per class) for multiclass problems.
Information processing proceeds in three steps:
1. Weighted sum at each neuron: 𝑧𝑗 = ∑𝑖 𝑤𝑗𝑖 𝑥𝑖 + 𝑏𝑗
2. Nonlinear activation: ℎ𝑗 = 𝜙(𝑧𝑗 )
3. Propagation: Activated outputs become inputs to the next layer.
Because of the nonlinearity, MLPs can approximate highly complex functions. In fact,
with at least one hidden layer and suitable activation functions, an MLP is a universal
approximator, capable of approximating any continuous function on a compact domain,
given sufficiently many hidden units.
Diagram (MLP Architecture)
Think of an MLP diagram as:
• Left column: Input neurons 𝑥1 , 𝑥2 , … , 𝑥𝑑
(1) (𝐿)
• Middle columns: One or more hidden layers with neurons ℎ1 , … , ℎ𝑚
• Right column: Output neurons 𝑦ˆ1 , … , 𝑦ˆ𝐾
All neurons in one layer are fully connected to neurons in the next layer via directed
edges with weights. There are no connections within a layer or backwards connections.
Important Properties
• Deterministic mapping: For given parameters, the mapping 𝐱 ↦ 𝐲ˆ is deterministic.
• Differentiable: Using differentiable activations (ReLU is piecewise linear,
sigmoid/tanh are smooth), the network is differentiable almost everywhere,
enabling gradient-based training.
• Capacity control: Number of layers and number of units per layer control model
capacity. Too few → underfitting; too many → risk of overfitting without proper
regularization.
Gradient Descent
Gradient Descent (GD) is the fundamental optimization algorithm used to train neural
networks and many other machine learning models. The goal of training is to minimize a
loss function 𝐿(𝜃), where 𝜃 denotes all learnable parameters (weights and biases). GD
updates parameters iteratively in the opposite direction of the gradient of the loss.
Given current parameters 𝜃 (𝑡) , the update rule is:
𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂∇𝜃 𝐿(𝜃 (𝑡) )
where 𝜂 > 0 is the learning rate, and ∇𝜃 𝐿 is the gradient of the loss with respect to
parameters.
Intuition
The gradient indicates the direction of steepest increase of the loss. Moving in the
opposite direction reduces the loss. The learning rate determines the step size:
• If 𝜂 is too small, convergence is very slow.
• If 𝜂 is too large, the algorithm may overshoot minima and diverge.
Variants of Gradient Descent
1. Batch Gradient Descent
Uses the entire training set to compute the gradient:
𝑁
1
∇𝜃 𝐿(𝜃) = ∑ ∇𝜃 ℓ(𝑓𝜃 (𝐱𝑖 ), 𝑦𝑖 )
𝑁
𝑖=1
where 𝑁 is the number of training examples, ℓ is the per-example loss, and
𝑓𝜃 (⋅) is the network output.
2. Stochastic Gradient Descent (SGD)
Uses a single randomly chosen example at each step:
𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂∇𝜃 ℓ(𝑓𝜃 (𝐱 𝑖 ), 𝑦𝑖 )
This introduces noise, but often converges faster in practice and helps escape
shallow local minima.
3. Mini-batch Gradient Descent
Uses a small batch of examples (e.g., 32/64/128) per step, striking a balance
between efficiency and gradient stability. This is the default in most deep learning
libraries.
Diagram (Loss Surface and Gradient)
Visualize gradient descent as a ball rolling down a curved surface:
• The height of the surface represents the loss.
• The position represents the parameter values.
• The gradient vector points uphill; we step in the opposite direction.
Practical Considerations
• Learning rate scheduling (decay, step, cosine) can improve convergence.
• Momentum-based methods (e.g., SGD with momentum, Adam) accumulate past
gradients to accelerate training and smooth noisy updates.
• Proper initialization and normalization (e.g., batch normalization) interact strongly
with gradient descent performance.
Backpropagation
Backpropagation is the algorithm used to efficiently compute the gradients of the
loss function with respect to all parameters in a neural network. It applies the chain rule of
calculus layer by layer, propagating error signals from the output back to earlier layers.
For a network with parameters 𝜃 and loss 𝐿, naive differentiation of each parameter
separately is computationally expensive. Backpropagation exploits the layered structure of
the network to reuse intermediate derivatives.
Forward Pass
1. Input 𝐱 is passed through each layer:𝐚(1) = 𝐱, 𝐳 (𝑙) = 𝑊 (𝑙) 𝐚(𝑙−1) + 𝐛 (𝑙) , 𝐚(𝑙) =
𝜙 (𝑙) (𝐳 (𝑙) )
2. Output 𝐲ˆ = 𝐚(𝐿) is compared with target 𝐲 via a loss 𝐿(𝐲ˆ, 𝐲).
Backward Pass
We compute error signals (often called deltas) starting from the last layer:
𝛿 (𝐿) = ∇𝐳 (𝐿) 𝐿
For each previous layer:
𝛿 (𝑙) = (𝑊 (𝑙+1) )⊤ 𝛿 (𝑙+1) ⊙ 𝜙 ′(𝑙) (𝐳 (𝑙) )
where ⊙ denotes element-wise multiplication and 𝜙 ′(𝑙) is the derivative of the
activation.
Once we have 𝛿 (𝑙) , gradients for parameters are:
𝜕𝐿 𝜕𝐿
(𝑙)
= 𝛿 (𝑙) (𝐚(𝑙−1) )⊤ , (𝑙) = 𝛿 (𝑙)
𝜕𝑊 𝜕𝐛
These gradients are passed to the optimizer (e.g., gradient descent) to update
weights and biases.
Diagram (Forward and Backward Flows)
In a diagram:
• Solid arrows from left to right: forward propagation of activations.
• Dashed arrows from right to left: backward propagation of error signals/gradients.
Relationship Between Backpropagation and Gradient Descent
Backpropagation computes gradients; gradient descent (or its variants) uses gradients
to update parameters. They are complementary:
• Backpropagation = differentiation engine.
• Gradient descent = optimization engine.
Empirical Risk Minimization (ERM) and Regularization
In supervised learning, we are given a training dataset:
𝒟 = {(𝐱 𝑖 , 𝑦𝑖 )}𝑁
𝑖=1
We choose a hypothesis class ℋ (e.g., all MLPs with a given architecture) and a
loss function ℓ(𝑦ˆ, 𝑦) (e.g., cross-entropy, MSE). The Empirical Risk Minimization
(ERM) principle says that the learning algorithm should select a hypothesis ℎ ∈ ℋ that
minimizes the empirical risk:
𝑁
1
𝑅emp (ℎ) = ∑ ℓ(ℎ(𝐱 𝑖 ), 𝑦𝑖 )
𝑁
𝑖=1
In deep learning, ℎ is represented by a neural network 𝑓𝜃 , and ERM amounts to
minimizing the average loss over the training data using gradient-based optimization.
Overfitting and the Need for Regularization
If the network has very high capacity (many parameters), it can memorize the
training data, achieving near-zero empirical risk but performing poorly on unseen data. This
phenomenon is called overfitting. To combat overfitting, we use regularization, which
constrains the model or modifies the objective to favour simpler hypotheses.
Common Regularization Techniques
1. L2 Regularization (Weight Decay)
Adds a penalty proportional to the squared magnitude of weights:
𝑅reg (𝜃) = 𝑅emp (𝜃) + 𝜆 ∑ 𝑤𝑗2
𝑗
where 𝜆 > 0 is the regularization coefficient. This discourages large weights and
leads to smoother functions.
2. L1 Regularization
Adds a penalty proportional to the absolute value of weights:
𝑅reg (𝜃) = 𝑅emp (𝜃) + 𝜆 ∑ |𝑤𝑗 |
𝑗
L1 tends to produce sparse models (many weights exactly zero), performing
implicit feature selection.
3. Dropout
During training, randomly “drops” (sets to zero) a fraction of neurons’ activations.
This prevents co-adaptation of features and acts as strong regularization.
4. Early Stopping
Monitor validation loss while training. Stop training when validation performance
starts to degrade (even if training loss is still decreasing). This implements a form
of regularization by limiting effective complexity.
5. Data Augmentation
Enriches the dataset by applying label-preserving transformations (e.g., rotations,
flips for images), reducing overfitting and improving generalization.
Structural Risk Minimization (SRM)
ERM alone minimizes training error but ignores model complexity. Structural
Risk Minimization (SRM) combines empirical risk with a complexity penalty:
𝑅struct (ℎ) = 𝑅emp (ℎ) + 𝜆 ⋅ Ω(ℎ)
where Ω(ℎ) measures complexity (e.g., norm of weights, VC dimension).
Regularized training of deep networks is practically an SRM approach.
Autoencoders
An autoencoder is a neural network trained to reconstruct its input. It is typically
used for unsupervised representation learning, dimensionality reduction, denoising, and
anomaly detection. Instead of predicting external labels, the autoencoder’s target is the
input itself.
Architecture
An autoencoder consists of two main parts:
1. Encoder: Maps input 𝐱 to a lower-dimensional latent representation 𝐳:
𝐳 = 𝑓𝜃 (𝐱)
Typically implemented using one or more layers with decreasing width (a
bottleneck).
2. Decoder: Reconstructs the input from latent vector 𝐳:
𝐱ˆ = 𝑔𝜙 (𝐳)
The network is trained to minimize a reconstruction loss, such as mean squared error:
𝑁
1
𝐿(𝜃, 𝜙) = ∑ ‖𝐱 𝑖 − 𝐱ˆ 𝑖 ‖2
𝑁
𝑖=1
Undercomplete Autoencoder
An undercomplete autoencoder has a latent dimension smaller than the input
dimension. This forces the model to learn a compressed representation that captures the
most salient structure in the data, rather than simply copying inputs. Such representations
can be used as features for downstream tasks (classification, clustering, retrieval).
Diagram (Autoencoder)
• Input layer: 𝐱
• Encoder layers: progressively smaller hidden layers ending in bottleneck 𝐳
• Decoder layers: progressively larger layers reconstructing 𝐱ˆ
Visually, the network looks like an hourglass or “sandglass” architecture, with the
narrowest layer in the centre.
Variants
1. Denoising Autoencoder:
Corrupts the input (e.g., adding noise, masking pixels) and trains the autoencoder to
reconstruct the original clean input. This encourages robust feature learning.
2. Sparse Autoencoder:
Uses sparsity constraints (e.g., L1 penalty on activations) so that only a small subset
of neurons are active for a given input. This leads to more interpretable and localized
features.
3. Variational Autoencoder (VAE):
A probabilistic extension where the encoder outputs parameters of a distribution
over latent variables (mean and variance). VAEs enable generative modelling and
sampling.
Importance
• Provides a powerful framework for unsupervised pretraining of deep networks.
• Learns compact, meaningful latent representations useful for visualization and
downstream tasks.
• Forms a building block for more advanced generative models.
Deep Neural Networks: Difficulty of Training
As we increase the number of layers in a neural network, we obtain a deep neural
network. Deep models can represent extremely complex functions and hierarchical
features, but they are hard to train effectively.
Major Challenges
1. Vanishing and Exploding Gradients
During backpropagation, gradients are repeatedly multiplied by weight matrices and
derivatives of activation functions. For activations like sigmoid or tanh, derivatives
are in (0,1). Multiplying many such terms causes gradients to shrink
exponentially, leading to vanishing gradients in early layers. As a result, weights
in the first layers receive almost no updates and fail to learn. Conversely, if
derivatives or weights are large, gradients can blow up, causing exploding
gradients.
2. Poor Conditioning and Local Minima
The optimization landscape of deep networks is highly non-convex with plateaus,
saddle points, and narrow valleys. Gradient descent may get stuck or move very
slowly in such regions.
3. Sensitivity to Initialization
Random initialization can place the model in regions of the parameter space where
gradients are tiny or unstable. Careful initialization schemes (e.g., Xavier/Glorot,
He initialization) are critical for stable training.
4. Overfitting
Deep networks have very high capacity. Without sufficient data and proper
regularization, they can severely overfit the training set.
5. Computational Cost
Many layers and parameters increase computation and memory requirements.
Training deep models demands GPUs/TPUs and optimized implementations.
Mitigation Strategies
• ReLU and Variants: ReLU activation (max(0, 𝑥)) has a derivative of 1 in the
positive regime, helping alleviate vanishing gradients compared to sigmoid/tanh.
• Batch Normalization: Normalizes layer activations, stabilizing distributions across
layers and allowing higher learning rates.
• Residual Connections (ResNets): Shortcut connections that add the input of a
block directly to its output, enabling gradients to flow more easily through many
layers.
• Careful Initialization: Xavier/Glorot for tanh, He initialization for ReLU-based
networks.
• Regularization: Dropout, weight decay, early stopping to control overfitting.
Greedy Layer-Wise Training
Before modern optimization and architectural techniques became standard, training
very deep networks directly with gradient descent from random initialization often failed.
A key idea that unlocked deeper architectures was greedy layer-wise training, introduced
by Hinton and co-authors for Deep Belief Networks and later adapted to autoencoders and
other architectures.
Basic Idea
Instead of training all layers simultaneously from scratch, we:
1. Train the first layer to learn a good representation of the input (e.g., as an
autoencoder or Restricted Boltzmann Machine).
2. Freeze the first layer and train the second layer to model the representation produced
by the first layer.
3. Continue this process layer by layer (“greedy” because each step optimizes only the
current layer).
4. After unsupervised pretraining of all layers, stack them to form a deep network and
perform supervised fine-tuning using backpropagation on the full model.
Why It Helps
• Better Initialization: Pretraining puts network parameters in a region of the
parameter space that already captures structure in the data, making supervised
optimization easier and reducing the risk of bad local minima.
• Information Preservation: Each layer is trained to preserve information about its
input (e.g., autoencoding). This leads to hierarchical representations that retain
useful features.
• Regularization Effect: Acts as a form of regularization, biasing the network
towards feature hierarchies consistent with the unsupervised learning criterion.
Diagram (Layer-Wise Training)
You can visualize the process as:
1. Train layer 1 (input → hidden ) as an autoencoder or RBM.
2. Use hidden activations as input to train layer 2 (hidden → hidden ).
3. Repeat for deeper layers.
4. Add an output layer on top, then fine-tune the whole network using labeled data.
Modern Perspective
With today’s advances (ReLU, batch normalization, residual connections, powerful
optimizers, large labeled datasets), direct end-to-end training of deep networks is often
feasible and standard. However:
• Greedy layer-wise pretraining remains conceptually important and still useful in
low-label or unsupervised settings (e.g., self-supervised learning, representation
learning).
• The general idea-first learn good representations, then fine-tune for a specific
task-continues to influence many modern methods (e.g., pretraining large models
on generic data and fine-tuning on downstream tasks).