Unit2 DeepLearning ComprehensiveNotes
Unit2 DeepLearning ComprehensiveNotes
UNIT 2 — COMPREHENSIVE
NOTES
1. Optimization Techniques
1.1 Gradient Descent — Full Derivation & Variants
1.2 The Delta Rule & Learning Rates
1.3 Batch, Stochastic & Mini-Batch Optimization
1.4 Adaptive Moment Estimation (Adam)
1.5 RMSProp
2. Backpropagation Algorithm
2.1 The Four Fundamental Equations
2.2 Complete Chain Rule Derivation
5. Attention Mechanism
5.1 Query, Key, Value Framework
5.2 Dot-Product (Scaled) Attention
5.3 Additive (Bahdanau) Attention
5.4 Multi-Head Attention
5.5 Self-Attention
Why Optimization?
A neural network learns by minimizing a Loss Function L(W) over millions of parameters W. Optimization
algorithms update these weights iteratively so the network's predictions get closer to the true labels. The
choice of optimizer is one of the most critical engineering decisions in deep learning — it directly affects
training speed, stability, and final accuracy.
Mathematical Formulation:
Where eta (eta) is the learning rate — the single most important hyperparameter. N is the number of
samples. L is the loss (e.g., MSE or cross-entropy).
This says: adjust weight by how wrong we were (target - output), scaled by the input and learning
rate. For multi-layer networks, this generalises into backpropagation.
ReduceLROnPlateau Reduce eta when val_loss stops improving Practical default choice
Speed per epoch Slow (all data) Fast (1 step) Fastest in practice
Convergence Smooth but slow Noisy, can escape minima Best balance
Typical use case Small datasets Online learning Deep learning default
Mini-batch GD combines the best of both worlds: the gradient estimate is accurate enough for stable
convergence, while the batch size allows GPU parallelism. The noise in the gradient (from not seeing all
data) actually helps escape sharp local minima — a phenomenon called implicit regularization. Typical
batch sizes: 32, 64, 128, 256.
Step 1: Compute
gradient g_t = gradient_L(W_t)
Step 2: Update 1st
moment (mean) m_t = beta_1 * m_(t-1) + (1 - beta_1) * g_t
Step 5: Weight
update W_t = W_(t-1) - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)
Default Hyperparameters (use these unless you have a reason not to):
• Learning rate eta = 0.001
• beta_1 = 0.9 (momentum decay — how much past gradient is remembered)
• beta_2 = 0.999 (variance decay — tracks squared gradient magnitude)
• epsilon = 1e-8 (prevents division by zero)
Adam: Use for quick experimentation, NLP, transformers, and when you need fast convergence. SGD
with momentum + LR scheduler: Often achieves better final accuracy in computer vision (e.g., ResNet
on ImageNet) because the sharper minimum Adam finds may generalise worse. Rule of thumb:
Prototype with Adam, then fine-tune with SGD for production models.
1.5 RMSProp
RMSProp (Root Mean Square Propagation, Hinton 2012) adapts the learning rate for each parameter
independently based on the recent magnitude of gradients. It was specifically designed to work well
with RNNs and non-stationary problems.
RMSProp divides the learning rate by a running average of recent gradient magnitudes. Parameters
that have been receiving large gradient updates get a smaller effective learning rate, preventing them
from overshooting. Default rho = 0.9.
SECTION 2: BACKPROPAGATION ALGORITHM
Backpropagation is the algorithm that makes training deep networks computationally feasible. It
efficiently computes the gradient of the loss with respect to every weight in the network by applying
the chain rule backwards through the computation graph.
Notation
L = loss | l = layer index | j,k = neuron indices | delta^l_j = error at neuron j in layer l | w^l_jk = weight from
neuron k in layer l-1 to neuron j in layer l | b^l_j = bias | z^l_j = weighted input | a^l_j = activation | sigma =
activation function
How wrong is each output neuron? Multiply how much loss changes w.r.t. output by the derivative of
activation. For MSE + sigmoid: delta^L = (a^L - y) * sigma'(z^L)
Pass error backwards through weights. The transpose W^T reverses the forward direction.
The gradient w.r.t. bias equals the error at that neuron directly.
The gradient w.r.t. weight = activation of previous layer * error of current layer.
Vanishing Gradients: In deep networks with sigmoid/tanh, sigma'(z) < 0.25 always. After L layers of
multiplication, the gradient shrinks exponentially — early layers learn nothing. Fix: Use ReLU (sigma'(z)
= 1 for z>0), Batch Normalization, residual connections. Exploding Gradients: Weights are large ->
gradients grow exponentially. Fix: Gradient clipping (clip ||g|| to max_norm), weight regularization,
BatchNorm.
SECTION 3: EFFECTIVE TRAINING TECHNIQUES
Key Properties:
• Ensemble interpretation: Dropout trains 2^N different sub-networks; test uses their geometric mean
• Typical keep probability: p=0.5 for hidden layers, p=0.8 for input layers
• Works best for fully connected layers; less beneficial for BatchNorm-equipped conv layers
• Inverted dropout (modern): divide by p during training instead of multiplying during test
• Do NOT use dropout and batch normalization in the same layer — they interact poorly
Step 2: Batch
variance sigma^2_B = (1/m) * SUM (x_i - mu_B)^2
Benefits of BatchNorm:
• Allows higher learning rates → faster training
• Reduces dependence on careful weight initialisation
• Acts as a regularizer → often reduces need for Dropout
• At test time: uses running mean/variance estimated during training (not batch statistics)
• Placement: Typically after linear/conv layer, before activation (debated in literature)
Attention Mechanism is the foundational innovation behind Transformers, GPT, BERT, and virtually
all modern AI systems. Originally proposed for machine translation (Bahdanau et al., 2015), it solves
the bottleneck of encoding long sequences into a fixed-size vector.
Query (Q) Search query you type into Google What information am I looking for?
Key (K) Keywords/tags on each webpage What information does each position contain?
Value (V) Actual content of each webpage What information gets retrieved if selected?
In a self-attention layer, Q, K, V are all derived from the same input X via learned linear projections: Q
= X * W_Q, K = X * W_K, V = X * W_V. The weight matrices W_Q, W_K, W_V (all shape d_model x
d_k) are the learnable parameters.
Step-by-Step Computation:
1. Linear projections: Q=XW_Q, K=XW_K, V=XW_V (shapes: [seq_len, d_k])
2. Compute scores: S = QK^T / sqrt(d_k) (shape: [seq_len, seq_len] — every position attends to
every other)
3. Apply softmax: A = softmax(S) (rows sum to 1 — attention weight distribution)
4. Optional: Apply mask (for decoder — prevent attending to future positions)
5. Weighted sum: Output = A * V (shape: [seq_len, d_v])
5.3 Additive (Bahdanau) Attention
Bahdanau et al. (2015) introduced attention in the context of neural machine translation. Their
additive attention uses an MLP to compute alignment scores between query and key, making it more
flexible than the dot-product version but computationally heavier.
Mechanism Explained:
W_q and W_k project q and k into the same space; they are added (not multiplied), then passed
through tanh. The vector v_a (learnable) converts the hidden representation to a scalar score. This
extra MLP gives additive attention more expressiveness than dot-product attention for small d_k, but
requires O(n^2 * d) operations, making it slower for long sequences.
W_O (d_model x d_model) is an output projection matrix. Typically h=8 heads with d_k = d_v =
d_model/h = 64 for d_model=512. Total parameter count is same as single-head attention — we just
split and recombine.
5.5 Self-Attention
In self-attention, Q, K, V all come from the same sequence. Each position attends to all other
positions in the sequence — enabling the model to build rich, long-range representations without any
recurrence (unlike RNNs).
Self-Attention vs Cross-Attention
Self-Attention: Q, K, V from the SAME sequence. Used in Transformer encoder. Cross-Attention: Q from
decoder, K and V from encoder output. Used in Transformer decoder. This is how the decoder 'looks at'
the source sentence while generating translations.
SECTION 6: THEORY PRACTICE QUESTIONS —
UNIVERSITY EXAM
These questions are modelled on SRM IST University exam patterns at Bloom's Levels 2–5
(Understand, Apply, Analyse, Evaluate). Questions marked [12M] require a structured essay with
diagrams/derivations. [8M] require detailed explanations. [2M] = short answers.
Q1 Derive the complete Adam optimizer update rule from first principles. [12M] [Unit 2]
[BL4-Analyse]
. Explain the role of the first moment (m_t), second moment (v_t), and bias [Optimization]
correction terms. Compare Adam with vanilla SGD and RMSProp in terms
of convergence behaviour and hyperparameter sensitivity. When would you
choose SGD over Adam?
Q2 Derive the four fundamental equations of backpropagation using the chain [12M] [Unit 2] [BL3-Apply]
[Backprop]
. rule. Show how these equations are applied in a 3-layer neural network with
sigmoid activation to update weights. Discuss the vanishing gradient
problem and state at least three solutions with mathematical justification.
Q3 Explain Batch Normalization with its complete algorithm (all 4 steps). Derive [12M] [Unit 2]
[BL4-Analyse]
. why learnable parameters gamma and beta are necessary. Compare Batch [BatchNorm]
Normalization with Instance Normalization and Group Normalization —
explain when each should be used in a medical imaging pipeline (e.g., MRI
segmentation).
Q5 Explain the bias-variance tradeoff in the context of overfitting in deep neural [12M] [Unit 2]
[BL5-Evaluate]
. networks. Derive the expected error decomposition. Describe five strategies [Overfitting]
to combat overfitting when you have a small medical dataset (e.g., 500
chest X-rays). Justify each strategy mathematically or empirically.
Q6 Derive and compare Batch Gradient Descent, Stochastic Gradient Descent, [8M] [Unit 2] [BL3-Apply]
[GD Variants]
. and Mini-Batch Gradient Descent. Prove that mini-batch GD provides an
unbiased estimate of the full batch gradient. Discuss the effect of batch size
on: (a) gradient variance, (b) convergence speed, (c) generalisation
performance.
Q7 Explain Dropout regularization with its mathematical formulation. Describe [8M] [Unit 2] [BL3-Apply]
[Dropout]
. the ensemble interpretation of Dropout and prove that the test-time weight
scaling (W_test = p * W_train) is needed. How does inverted Dropout differ,
and why is it preferred in modern frameworks like PyTorch?
Q8 Explain Early Stopping as a regularization technique. Under what conditions [8M] [Unit 2]
[BL4-Analyse] [Early
. is early stopping equivalent to L2 regularization? Describe the role of Stopping]
patience and validation loss monitoring. Design a complete training protocol
for a deep learning model on a 1000-sample ECG dataset using early
stopping.
Q9 Explain Multi-Head Attention and Self-Attention mechanisms. Derive the [8M] [Unit 2]
[BL4-Analyse]
. output of Multi-Head Attention with h=2 heads for a sequence of length 4 [Multi-Head]
with d_model=4. Why is self-attention superior to RNNs for capturing
long-range dependencies? Give a medical NLP example (e.g., clinical note
understanding).
Q1 Critically compare the following optimizers for training a deep CNN on a [12M] [Unit 2]
[BL5-Evaluate] [Pipeline
0. medical image segmentation task with 2000 samples and severe class Design]
imbalance: (a) SGD with momentum, (b) Adam, (c) RMSProp. Discuss
choice of loss function, learning rate schedule, and regularization strategy.
Justify your final recommended pipeline.
SECTION 7: PRACTICAL / CODING QUESTIONS
Format
These questions test implementation ability and project-level thinking. Each requires working code
(PyTorch or Keras/TensorFlow), analysis of results, and discussion of design choices. They mirror the
kind of questions asked in company coding rounds (Google, Microsoft) and research paper
reproductions.
Q1 Implement Mini-Batch SGD, Adam, and RMSProp from scratch in NumPy [Coding] [NumPy]
[Optimizers] [Analysis]
. (no deep learning libraries). Apply all three to minimise f(x,y) = x^2 + 10y^2
from starting point (5, 5). Plot the convergence curves. Report final loss
after 200 steps. Which converges fastest? Explain why based on the
mathematical properties of each optimizer.
Q2 Using PyTorch, implement a 3-layer MLP on the MNIST dataset. Train three [PyTorch] [MLP]
[Regularization] [MNIST]
. versions: (a) no regularization, (b) Dropout(p=0.5) on hidden layers, (c)
Batch Normalization after each linear layer. Compare training loss,
validation accuracy, and training time. Plot learning curves. Which
generalises best and why?
Q3 Implement Backpropagation from scratch in NumPy for a 2-layer network [NumPy] [Backprop]
[Gradient Check] [Coding]
. with sigmoid activations. Verify your implementation with gradient checking
(numerical gradient vs analytical gradient, tolerance 1e-5). Train on a toy
XOR dataset and confirm convergence. Report the forward pass, loss, and
weight updates for epoch 1.
Q5 Load a public medical image dataset (e.g., Chest X-Ray14 or ISIC skin [PyTorch] [Augmentation]
[Medical Imaging]
. lesion). Implement a data augmentation pipeline using [ResNet]
[Link] with at least 6 augmentations. Train a ResNet-18: (a)
without augmentation, (b) with augmentation. Compare validation AUC.
Visualise 5 augmented samples before training.
Q6 Demonstrate the vanishing gradient problem: train a 10-layer network with [PyTorch] [Gradient
Analysis] [BatchNorm]
. sigmoid activation and plot the gradient magnitudes at each layer during [Visualization]
backpropagation. Then switch to ReLU activation and repeat. Explain the
difference quantitatively. Finally, add Batch Normalization and show how it
resolves the issue.
Q7 Implement early stopping from scratch in PyTorch (no callbacks). Define a [PyTorch] [Early
Stopping] [Custom Class]
. custom EarlyStopping class with parameters: patience=10, [Training Loop]
min_delta=0.001, restore_best_weights=True. Train a CNN on a small
dataset (CIFAR-10 subset of 2000 samples). Show the training curve with
the stopping point marked clearly.
Q8 Build a complete training pipeline in PyTorch that handles class imbalance [PyTorch] [Focal Loss]
[Class Imbalance]
. using Focal Loss (implement it from scratch). Use a [Medical]
WeightedRandomSampler to oversample minority classes. Test on an
imbalanced dataset (e.g., 90% negative, 10% positive chest X-ray).
Compare standard CE loss vs Focal Loss in terms of sensitivity/specificity.
Q9 Implement and compare three normalization strategies on the same CNN [PyTorch] [Normalization]
[BatchNorm]
. architecture: BatchNorm, LayerNorm, and GroupNorm (G=8). Train on [Experiment]
CIFAR-10 with batch sizes of [1, 4, 32, 128]. Plot validation accuracy vs
batch size for each normalization type. Explain the observed trends from
first principles.
Q1 Real-World Project: Design a complete training system for classifying 5 [Project] [Medical MRI]
[End-to-End] [Full
0. types of brain tumours from MRI scans (1000 images, highly imbalanced). Pipeline]
Your solution must include: (a) data augmentation pipeline (elastic
deformation + colour jitter), (b) Adam + cosine annealing LR scheduler, (c)
focal loss with class weights, (d) early stopping with patience=15, (e) batch
norm in each conv block. Report validation macro-F1 score and confusion
matrix. Justify every design choice.
SECTION 8: COMPETITIVE EXAM & RESEARCH
QUICK REFERENCE
Nielsen — Neural Networks & DL Backprop 4 equations, GD intuition — Chapters 1–3 Free: neuralnetworksanddeeplearning.c
[Link] — Zhang et al. Attention (Ch 11), Optimization (Ch 12), BatchNorm (Ch
Free:
8.5)[Link]
Vaswani et al. (2017) Original Scaled Dot-Product & Multi-Head Attention arXiv:1706.03762 (free)
Bahdanau et al. (2015) Original Additive Attention — Align & Translate arXiv:1409.0473 (free)
All the best, Geeta! This document covers every topic in Unit 2 of 21BME381T at the depth
needed for university exams, real-world deep learning projects, and competitive exams like
GATE. Work through the theory questions first, then implement the practical ones. The
Goodfellow Ch. 7–8 + [Link] Ch. 11 combination will give you complete mastery. ■