UNIT II
LEARNING IN DEEP NETWORKS
UNIT II LEARNING IN DEEP NETWORKS
Back propagation training, Learning the weights, Chain rule, Stochastic
gradient descent, Sigmoid units and vanishing gradient, Rectified Linear Unit
(ReLU) and its variants - Cross entropy for classification and activation, Batch
learning.
Backpropagation (Backward Propagation of Errors) is a supervised learning algorithm used to train neural
networks by minimizing the loss function.
It computes gradients of the loss with respect to weights and biases using the chain rule of calculus and
updates them iteratively.
• 2. Purpose of Backpropagation
• Reduce the difference between predicted output and actual output
• Enable learning in multi-layer (deep) networks
• Optimize model parameters efficiently
• 3. Why Backpropagation is Important
• Efficient Weight Update
Computes gradients for all parameters in a single backward pass.
• Scalability
Works for deep and complex architectures.
• Automated Learning
Network adjusts weights automatically to reduce error.
• Foundation of Deep Learning
Enables CNNs, RNNs, Transformers, etc.
[Link]
Training Process Overview
• Backpropagation consists of two main phases:
• Forward Pass
• Backward Pass
Backward Pass (Core of Backpropagation)
•Error is propagated from output layer to input layer
•Gradients are computed using chain rule
•Each weight is adjusted based on its contribution to the error
Iterative Learning
•Forward pass → Error calculation → Backward pass → Weight update
•Repeated over many epochs
•Training continues until loss is minimized
Advantages Challenges / Limitations
[Link] to implement [Link] Gradient Problem
[Link] gradient computation [Link] Gradient Problem
[Link] with deep networks [Link] in complex networks
[Link] generalization ability [Link] to learning rate
[Link] to large datasets [Link] differentiable functions
• Example of Back Propagation
BACK PROPAGATION
Chain Rule:
• A neural network is a computational graph made of many connected functions.
The chain rule is used to compute how the loss changes with respect to each
weight and bias by multiplying derivatives through these functions.
• Backpropagation is the systematic application of the chain rule to efficiently
calculate gradients, which are used to update the network parameters. Thus, the
chain rule is the core mathematical principle behind learning in deep neural
networks.
• Without the chain rule, deep learning models cannot be trained.
Chain Rule:
• “Learning the weights” means automatically finding the best values of the
weights and biases of a neural network so that its predictions become accurate.
• Learning the weights is achieved by:
Backpropagation (to compute gradients) + Stochastic Gradient Descent (to
update parameters).
• Learning the weights means using gradients and SGD to adjust weights so that the
neural network makes better predictions.
Gradient Descent
• Gradient Descent is an optimization algorithm used in deep learning to
train a neural network by minimizing the loss (error) of the model.
• It helps the model learn by adjusting weights and biases so that predictions
become more accurate.
Why do we need Gradient Descent?
• In deep learning, the model makes predictions and calculates an error using a
loss function.
The goal is to reduce this error as much as possible.
Gradient Descent finds the best values of parameters that make this error
minimum.
How does Gradient Descent work?
• The neural network makes a prediction.
• The error (loss) is calculated.
• The gradient (slope of the error curve) is computed.
• The weights are updated in the opposite direction of the gradient.
• This process repeats until the error becomes very small.
• This is done using the chain rule of calculus
Batch Gradient Descent
• In Batch Gradient Descent, the gradient is calculated using the entire training dataset.
• How it works
• The model processes all training samples.
• The total loss is computed.
• The gradient is calculated using all data.
• Weights are updated once per epoch.
η – Eta
Δ (Uppercase Delta)
δ (Lowercase Delta)
∇ (Nabla / Del Operator) — Gradient
∂ Partial Derivative
Stochastic Gradient Descent (SGD)
• In SGD, the gradient is computed using only one data sample at a time.
• How it works
• Pick one data point.
• Compute loss and gradient.
• Update weights immediately.
• Repeat for all data points.
Mini-Batch Gradient Descent
• This is a combination of Batch GD and SGD.
It uses a small group of samples (mini-batch) for each update.
• Typical mini-batch sizes: 32, 64, 128
• How it works
• Divide dataset into small batches.
• Compute gradient for one batch.
• Update weights.
• Repeat for all batches.
Comparison Table
Feature Batch GD SGD Mini-Batch GD
Data used Full dataset 1 sample Small batch
Speed Slow Very fast Fast
Stability Very stable Noisy Balanced
Memory High Low Medium
Used in practice Rare Sometimes Most common
Why Mini-Batch is preferred in Deep Learning?
Modern deep learning (CNNs, RNNs, Transformers) uses Mini-Batch Gradient
Descent because it:
•Works well with GPUs
•Is fast and stable
•Handles big datasets
Stochastic Gradient Descent (SGD) in Deep Learning
• Stochastic Gradient Descent (SGD) is one of the most important optimization
algorithms used to train neural networks. It updates the model’s weights using
one training example at a time instead of using the whole dataset.
• SGD is widely used in deep learning because it is fast, memory-efficient, and
effective for large datasets.
• Why is it called “Stochastic”?
• The word stochastic means random.
In SGD, one random data point is chosen at a time to compute the gradient and
update the weights.
So the updates are noisy and random, but this helps the model learn better.
How SGD works
• Suppose we have a dataset with 1,00,000 samples.
• Instead of waiting to process all 1,00,000 samples (as in Batch Gradient
Descent), SGD:
• Takes one sample
• Computes the loss
• Finds the gradient
• Updates the weights immediately
• Moves to the next sample
• This happens thousands of times in one training cycle (epoch).
Advantages of SGD
Disadvantages of SGD
[Link] fast for large datasets
[Link] value fluctuates
[Link] less memory
[Link] not move smoothly
[Link] escape local minima
[Link] careful learning rate tuning
[Link] well for online learning
Summary
Stochastic Gradient Descent is a fast and powerful algorithm that updates neural network weights
one data point at a time, allowing efficient learning for large-scale deep learning problems.