0% found this document useful (0 votes)
6 views20 pages

Unit2 DeepLearning ComprehensiveNotes

The document provides comprehensive notes on training deep neural networks, covering optimization techniques, backpropagation, effective training methods, and strategies to handle overfitting and small datasets. It includes detailed explanations of various algorithms such as Gradient Descent, Adam, and RMSProp, as well as training techniques like early stopping and dropout. Additionally, it addresses the importance of attention mechanisms and data strategies in deep learning applications in medicine.

Uploaded by

pankhurimithiya
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Unit2 DeepLearning ComprehensiveNotes

The document provides comprehensive notes on training deep neural networks, covering optimization techniques, backpropagation, effective training methods, and strategies to handle overfitting and small datasets. It includes detailed explanations of various algorithms such as Gradient Descent, Adam, and RMSProp, as well as training techniques like early stopping and dropout. Additionally, it addresses the importance of attention mechanisms and data strategies in deep learning applications in medicine.

Uploaded by

pankhurimithiya
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

21BME381T

Deep Learning Techniques in Medicine

UNIT 2 — COMPREHENSIVE
NOTES

Training Deep Neural Networks

Optimization Regularization Attention Mechanism Data Strategies

Course 21BME381T — Deep Learning Techniques in Medicine

Unit Unit 2 — Training Deep Neural Networks (9 Hours)

Prepared for Geeta, [Link] Biomedical Engineering, SRM IST

Coverage University Exam + Real-World Projects + Competitive Exams

References Goodfellow et al. | Nielsen | [Link] | Aggarwal | Calin


Table of Contents

1. Optimization Techniques
1.1 Gradient Descent — Full Derivation & Variants
1.2 The Delta Rule & Learning Rates
1.3 Batch, Stochastic & Mini-Batch Optimization
1.4 Adaptive Moment Estimation (Adam)
1.5 RMSProp

2. Backpropagation Algorithm
2.1 The Four Fundamental Equations
2.2 Complete Chain Rule Derivation

3. Effective Training Techniques


3.1 Early Stopping
3.2 Dropout
3.3 Batch Normalization
3.4 Instance Normalization
3.5 Group Normalization — Comparison Table

4. Handling Overfitting & Small Datasets


4.1 Overfitting — Bias-Variance Tradeoff
4.2 Data Augmentation
4.3 Redesigning the Loss Function
4.4 Generating Synthetic Data

5. Attention Mechanism
5.1 Query, Key, Value Framework
5.2 Dot-Product (Scaled) Attention
5.3 Additive (Bahdanau) Attention
5.4 Multi-Head Attention
5.5 Self-Attention

6. 10 Theory Practice Questions (University Exam)


7. 10 Practical / Coding Questions
8. Competitive Exam & Research Quick Reference
SECTION 1: OPTIMIZATION TECHNIQUES

Why Optimization?

A neural network learns by minimizing a Loss Function L(W) over millions of parameters W. Optimization
algorithms update these weights iteratively so the network's predictions get closer to the true labels. The
choice of optimizer is one of the most critical engineering decisions in deep learning — it directly affects
training speed, stability, and final accuracy.

1.1 Gradient Descent — Full Derivation & Variants


Gradient Descent (GD) is the backbone of all neural network training. The idea is simple: find the
direction in weight space that decreases the loss most steeply, and take a small step in that direction.
Core Intuition: Imagine the loss surface as a hilly landscape. You want to walk downhill (minimise
loss). Gradient gives the uphill direction, so you move opposite to it.

Mathematical Formulation:

Weight Update Rule W_(t+1) = W_t - eta * gradient_L(W_t)

Gradient w.r.t. W gradient_L(W) = (1/N) * SUM[ dL(y_i, y_hat_i) / dW ]

Where eta (eta) is the learning rate — the single most important hyperparameter. N is the number of
samples. L is the loss (e.g., MSE or cross-entropy).

Learning Rate Effects:


• eta too large → Loss oscillates or diverges (overshoots the minimum)
• eta too small → Training is extremely slow; might get stuck in local minima
• eta just right → Smooth convergence to a good minimum
• Solution: Learning Rate Schedulers (step decay, cosine annealing, warm restarts)

1.2 The Delta Rule & Learning Rates


The Delta Rule (Widrow-Hoff Rule) is the precursor to backpropagation, applicable to single-layer
linear networks. It forms the conceptual foundation for gradient-based weight updates.

Delta Rule delta_W = eta * (target - output) * input

This says: adjust weight by how wrong we were (target - output), scaled by the input and learning
rate. For multi-layer networks, this generalises into backpropagation.

Learning Rate Scheduling Strategies:


Strategy Formula / Idea When to Use

Step Decay eta = eta_0 * gamma^floor(epoch/step) Simple; fixed milestones

Exponential Decay eta = eta_0 * e^(-lambda*t) Smooth, gradual reduction

Cosine Annealing eta = eta_min + 0.5*(eta_max-eta_min)*(1+cos(pi*t/T))


State-of-the-art; avoids sharp drops

Warm Restarts Cosine + periodic reset Escaping local minima

ReduceLROnPlateau Reduce eta when val_loss stops improving Practical default choice

1.3 Batch, Stochastic & Mini-Batch Gradient Descent


The key distinction between GD variants is how many samples are used to compute the gradient
before updating weights. This is arguably the most practically important decision after choosing your
architecture.

Property Batch GD Stochastic (SGD) Mini-Batch GD

Samples per update All N samples 1 sample B samples (32–512)

Gradient quality Exact, stable Very noisy Good approximation

Speed per epoch Slow (all data) Fast (1 step) Fastest in practice

Memory usage High Very low Moderate

Convergence Smooth but slow Noisy, can escape minima Best balance

Parallelisation (GPU) Good Poor Excellent

Used in practice? Rarely Rarely (pure) Almost always

Typical use case Small datasets Online learning Deep learning default

Key Insight: Why Mini-Batch Wins

Mini-batch GD combines the best of both worlds: the gradient estimate is accurate enough for stable
convergence, while the batch size allows GPU parallelism. The noise in the gradient (from not seeing all
data) actually helps escape sharp local minima — a phenomenon called implicit regularization. Typical
batch sizes: 32, 64, 128, 256.

1.4 Adaptive Moment Estimation (Adam)


Adam (Kingma & Ba, 2014) is the de facto standard optimizer in deep learning. It combines the
benefits of two other methods: Momentum (accelerates in the right direction) and RMSProp (adapts
learning rate per parameter). The result is fast, stable training that works well with minimal
hyperparameter tuning.

Adam Algorithm — Step by Step:

Step 1: Compute
gradient g_t = gradient_L(W_t)
Step 2: Update 1st
moment (mean) m_t = beta_1 * m_(t-1) + (1 - beta_1) * g_t

Step 3: Update 2nd


moment (variance) v_t = beta_2 * v_(t-1) + (1 - beta_2) * g_t^2

Step 4: Bias m_hat_t = m_t / (1 - beta_1^t), v_hat_t = v_t / (1 -


correction beta_2^t)

Step 5: Weight
update W_t = W_(t-1) - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)

Default Hyperparameters (use these unless you have a reason not to):
• Learning rate eta = 0.001
• beta_1 = 0.9 (momentum decay — how much past gradient is remembered)
• beta_2 = 0.999 (variance decay — tracks squared gradient magnitude)
• epsilon = 1e-8 (prevents division by zero)

What does Adam actually do?


m_t is an exponential moving average of past gradients — this gives Adam momentum, smoothing
out noisy gradients. v_t is an exponential moving average of squared gradients — this scales the
learning rate: parameters with large gradients get a smaller effective eta, and parameters with small
gradients get a larger eta. The bias correction terms (Step 4) compensate for the zero-initialisation of
m and v at t=0, which would otherwise bias estimates toward zero in early training.

Adam vs SGD — Which to Use?

Adam: Use for quick experimentation, NLP, transformers, and when you need fast convergence. SGD
with momentum + LR scheduler: Often achieves better final accuracy in computer vision (e.g., ResNet
on ImageNet) because the sharper minimum Adam finds may generalise worse. Rule of thumb:
Prototype with Adam, then fine-tune with SGD for production models.

1.5 RMSProp
RMSProp (Root Mean Square Propagation, Hinton 2012) adapts the learning rate for each parameter
independently based on the recent magnitude of gradients. It was specifically designed to work well
with RNNs and non-stationary problems.

Cache update E[g^2]_t = rho * E[g^2]_(t-1) + (1-rho) * g_t^2

Weight update W_t = W_(t-1) - (eta / sqrt(E[g^2]_t + epsilon)) * g_t

RMSProp divides the learning rate by a running average of recent gradient magnitudes. Parameters
that have been receiving large gradient updates get a smaller effective learning rate, preventing them
from overshooting. Default rho = 0.9.
SECTION 2: BACKPROPAGATION ALGORITHM

Backpropagation is the algorithm that makes training deep networks computationally feasible. It
efficiently computes the gradient of the loss with respect to every weight in the network by applying
the chain rule backwards through the computation graph.

2.1 The Four Fundamental Equations (Nielsen, 2015)

Notation

L = loss | l = layer index | j,k = neuron indices | delta^l_j = error at neuron j in layer l | w^l_jk = weight from
neuron k in layer l-1 to neuron j in layer l | b^l_j = bias | z^l_j = weighted input | a^l_j = activation | sigma =
activation function

BP1 — Output layer error

Formula delta^L_j = (dL/da^L_j) * sigma'(z^L_j)

How wrong is each output neuron? Multiply how much loss changes w.r.t. output by the derivative of
activation. For MSE + sigmoid: delta^L = (a^L - y) * sigma'(z^L)

BP2 — Backpropagate error

delta^l = ((W^{l+1})^T * delta^{l+1}) elementwise_product


Formula sigma'(z^l)

Pass error backwards through weights. The transpose W^T reverses the forward direction.

BP3 — Gradient of bias

Formula dL/db^l_j = delta^l_j

The gradient w.r.t. bias equals the error at that neuron directly.

BP4 — Gradient of weight

Formula dL/dw^l_jk = a^{l-1}_k * delta^l_j

The gradient w.r.t. weight = activation of previous layer * error of current layer.

2.2 Full Backpropagation Algorithm


1. FORWARD PASS: Feed input x through the network. Store all z^l and a^l values at each layer.
2. OUTPUT ERROR: Compute delta^L using BP1.
3. BACKWARD PASS: For l = L-1, L-2, ..., 2 — compute delta^l using BP2.
4. GRADIENTS: Compute dL/dw and dL/db using BP3 and BP4.
5. UPDATE: Apply optimizer (GD, Adam, etc.) to update W and b.
6. REPEAT for each mini-batch until convergence.
Vanishing & Exploding Gradients — The Core Problem

Vanishing Gradients: In deep networks with sigmoid/tanh, sigma'(z) < 0.25 always. After L layers of
multiplication, the gradient shrinks exponentially — early layers learn nothing. Fix: Use ReLU (sigma'(z)
= 1 for z>0), Batch Normalization, residual connections. Exploding Gradients: Weights are large ->
gradients grow exponentially. Fix: Gradient clipping (clip ||g|| to max_norm), weight regularization,
BatchNorm.
SECTION 3: EFFECTIVE TRAINING TECHNIQUES

3.1 Early Stopping


Early stopping halts training when the validation loss stops improving, preventing the model from
memorising the training set. It is the simplest and most universally applicable form of regularization.
• Monitor: Validation loss (not training loss)
• Patience: Number of epochs to wait for improvement before stopping (typical: 5–20)
• Restore best weights: Rollback to the epoch with minimum validation loss
• Goodfellow et al.: Early stopping is equivalent to L2 regularization under certain conditions
• Keras: EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

3.2 Dropout (Srivastava et al., 2014)


Dropout randomly deactivates neurons during training with probability (1-p). This forces the network
to learn redundant representations — no single neuron can rely on its neighbours, so every neuron
must be independently useful.

Dropout mask r_j ~ Bernoulli(p) — r_j is 1 with prob p, 0 otherwise

Masked activation a_tilde^l = r^l elementwise_product a^l

Test time scaling W_test = p * W_train (multiply weights by keep-prob p)

Key Properties:
• Ensemble interpretation: Dropout trains 2^N different sub-networks; test uses their geometric mean
• Typical keep probability: p=0.5 for hidden layers, p=0.8 for input layers
• Works best for fully connected layers; less beneficial for BatchNorm-equipped conv layers
• Inverted dropout (modern): divide by p during training instead of multiplying during test
• Do NOT use dropout and batch normalization in the same layer — they interact poorly

3.3 Batch Normalization (Ioffe & Szegedy, 2015)


Batch Normalization normalises the input to each layer across the mini-batch. This stabilises training
by addressing internal covariate shift — the changing distribution of layer inputs as weights update.
It is one of the most impactful innovations in modern DL.

BatchNorm Algorithm (per mini-batch B):

Step 1: Batch mean mu_B = (1/m) * SUM x_i (for i in batch B)

Step 2: Batch
variance sigma^2_B = (1/m) * SUM (x_i - mu_B)^2

Step 3: Normalise x_hat_i = (x_i - mu_B) / sqrt(sigma^2_B + epsilon)


Step 4: Scale & shift y_i = gamma * x_hat_i + beta (gamma, beta are learnable)

Why gamma and beta (scale & shift)?


After normalisation, x_hat has zero mean and unit variance. But this might destroy useful learned
representations. gamma and beta are learnable parameters that allow the network to undo the
normalisation if needed — giving it full flexibility.

Benefits of BatchNorm:
• Allows higher learning rates → faster training
• Reduces dependence on careful weight initialisation
• Acts as a regularizer → often reduces need for Dropout
• At test time: uses running mean/variance estimated during training (not batch statistics)
• Placement: Typically after linear/conv layer, before activation (debated in literature)

3.4 & 3.5 Instance, Group Normalization — Comparison


BatchNorm fails when batch sizes are very small (e.g., object detection with batch=1). Alternative
normalization strategies compute statistics over different axes:

Normalization Normalises Over Best For Batch Size


Sensitivity

Batch Norm Batch + spatial Classification, High


(N, H, W) large batches (fails at B=1)

Layer Norm All channels + NLP, Transformers, None


spatial (C, H, W) RNNs

Instance Norm Spatial only per Style transfer, None


channel (H, W) image generation

Group Norm Channels in groups Object detection, None


+ spatial small batch (B=1 works)
SECTION 4: OVERFITTING & SMALL DATASET
STRATEGIES

4.1 Overfitting — Bias-Variance Tradeoff


Overfitting occurs when a model learns the training data too well — including its noise — and fails to
generalise to new data. It is the central challenge in machine learning.

Decomposition Expected Error = Bias^2 + Variance + Irreducible Noise

High Bias (Underfitting) High Variance (Overfitting)Ideal

Train error High Low Low

Val error High High Low

Train-Val gap Small Large Small

Model Too simple Too complex Just right

Fix More capacity, features Regularize, more data N/A

4.2 Data Augmentation


Data augmentation artificially increases the effective dataset size by applying label-preserving
transformations to existing samples.

Standard Image Augmentations:


• Geometric: Random crop, horizontal/vertical flip, rotation (±15°), translation, zoom
• Colour: Brightness, contrast, saturation jitter; random grayscale; colour channel dropout
• Noise: Gaussian noise, Cutout (random rectangular erasure), GridDistortion
• Advanced: MixUp (linearly interpolate two images + labels), CutMix (paste regions between
images)
• Medical-specific: Elastic deformation, intensity normalisation, random k-space masking (MRI)

Class Imbalance Solutions:


• Oversampling minority class (duplicate/augment minority samples)
• Undersampling majority class
• SMOTE: Synthetic Minority Oversampling — interpolate between existing minority samples
• Weighted loss function: Assign higher loss to misclassified minority class samples
• class_weight parameter in Keras/sklearn handles this automatically

4.3 Redesigning the Loss Function


When data is limited or imbalanced, modifying the loss function can be more powerful than collecting
more data. Here are key alternatives:
Loss Function Formula Best For

Cross Entropy -SUM y_i * log(p_i) Standard classification

Focal Loss -(1-p_t)^gamma * log(p_t) Class imbalance; object detection

Dice Loss 1 - 2|X∩Y| / (|X|+|Y|) Medical image segmentation

Label Smoothing CE -SUM [(1-eps)*y_i + eps/K] * log(p_i) Overconfidence; small data

Triplet Loss max(d(a,p)-d(a,n)+margin, 0) Metric learning; few-shot

Contrastive Loss y*d^2 + (1-y)*max(m-d,0)^2 Similarity learning

4.4 Generating Synthetic Data


• GANs (Generative Adversarial Networks): Generate realistic synthetic images for rare conditions
• VAEs (Variational Autoencoders): Sample from learned latent distribution
• Simulation: Physics-based simulators for surgical robots, scanners (e.g., BlenderProc for medical)
• Transfer Learning from related domains (e.g., natural images → medical images via fine-tuning)
• Domain Randomisation: Vary rendering parameters widely so real data falls within distribution
SECTION 5: ATTENTION MECHANISM

Attention Mechanism is the foundational innovation behind Transformers, GPT, BERT, and virtually
all modern AI systems. Originally proposed for machine translation (Bahdanau et al., 2015), it solves
the bottleneck of encoding long sequences into a fixed-size vector.

5.1 Query, Key, Value (Q, K, V) Framework


The QKV framework is a beautiful abstraction. Think of it like a search engine:

Component Analogy Mathematical Role

Query (Q) Search query you type into Google What information am I looking for?

Key (K) Keywords/tags on each webpage What information does each position contain?

Value (V) Actual content of each webpage What information gets retrieved if selected?

In a self-attention layer, Q, K, V are all derived from the same input X via learned linear projections: Q
= X * W_Q, K = X * W_K, V = X * W_V. The weight matrices W_Q, W_K, W_V (all shape d_model x
d_k) are the learnable parameters.

5.2 Scaled Dot-Product Attention


Scaled Dot-Product Attention (Vaswani et al., 2017 — 'Attention Is All You Need') is the most
computationally efficient attention variant and the basis of all Transformer models.

Score function score(Q, K) = Q * K^T / sqrt(d_k)

Attention weights alpha = softmax(Q * K^T / sqrt(d_k))

Context vector Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Why scale by sqrt(d_k)?


For large d_k, the dot products Q·K grow large in magnitude, pushing the softmax into regions of very
small gradients (saturation). Dividing by sqrt(d_k) keeps the variance of the dot products at 1
regardless of dimensionality, preventing gradient vanishing through the softmax. Mathematically: if Q
and K are vectors of i.i.d. components with mean 0, variance 1, their dot product has variance d_k —
dividing by sqrt(d_k) gives variance 1.

Step-by-Step Computation:
1. Linear projections: Q=XW_Q, K=XW_K, V=XW_V (shapes: [seq_len, d_k])
2. Compute scores: S = QK^T / sqrt(d_k) (shape: [seq_len, seq_len] — every position attends to
every other)
3. Apply softmax: A = softmax(S) (rows sum to 1 — attention weight distribution)
4. Optional: Apply mask (for decoder — prevent attending to future positions)
5. Weighted sum: Output = A * V (shape: [seq_len, d_v])
5.3 Additive (Bahdanau) Attention
Bahdanau et al. (2015) introduced attention in the context of neural machine translation. Their
additive attention uses an MLP to compute alignment scores between query and key, making it more
flexible than the dot-product version but computationally heavier.

Score (additive) score(q, k) = v_a^T * tanh(W_q * q + W_k * k)

Attention weights alpha_i = softmax(score(q, k_i))

Context vector c = SUM alpha_i * v_i

Mechanism Explained:
W_q and W_k project q and k into the same space; they are added (not multiplied), then passed
through tanh. The vector v_a (learnable) converts the hidden representation to a scalar score. This
extra MLP gives additive attention more expressiveness than dot-product attention for small d_k, but
requires O(n^2 * d) operations, making it slower for long sequences.

Dot-Product vs Additive Attention — Complete Comparison:

Property Dot-Product (Vaswani 2017) Additive (Bahdanau 2015)

Score function Q * K^T / sqrt(d_k) v^T * tanh(W_q*q + W_k*k)

Complexity O(n^2 * d) O(n^2 * d) — but larger constant

Speed Faster (matrix multiply) Slower (MLP per pair)

For large d_k Needs scaling factor Naturally bounded

Expressiveness Lower (linear) Higher (non-linear via tanh)

Used in Transformers, BERT, GPT Seq2Seq NMT, early NLP

Parameter count Just projection matrices Additional W_q, W_k, v_a

Masking support Yes (add -inf before softmax) Yes

5.4 Multi-Head Attention


Instead of performing a single attention function, Multi-Head Attention (MHA) runs h attention heads
in parallel, each learning to attend to different aspects of the input (e.g., one head for syntax, another
for semantics).

Each head head_i = Attention(Q*W_Q^i, K*W_K^i, V*W_V^i)

Multi-head output MHA(Q,K,V) = Concat(head_1,...,head_h) * W_O

W_O (d_model x d_model) is an output projection matrix. Typically h=8 heads with d_k = d_v =
d_model/h = 64 for d_model=512. Total parameter count is same as single-head attention — we just
split and recombine.
5.5 Self-Attention
In self-attention, Q, K, V all come from the same sequence. Each position attends to all other
positions in the sequence — enabling the model to build rich, long-range representations without any
recurrence (unlike RNNs).

Self-Attention vs Cross-Attention

Self-Attention: Q, K, V from the SAME sequence. Used in Transformer encoder. Cross-Attention: Q from
decoder, K and V from encoder output. Used in Transformer decoder. This is how the decoder 'looks at'
the source sentence while generating translations.
SECTION 6: THEORY PRACTICE QUESTIONS —
UNIVERSITY EXAM

How to Use This Section

These questions are modelled on SRM IST University exam patterns at Bloom's Levels 2–5
(Understand, Apply, Analyse, Evaluate). Questions marked [12M] require a structured essay with
diagrams/derivations. [8M] require detailed explanations. [2M] = short answers.

Q1 Derive the complete Adam optimizer update rule from first principles. [12M] [Unit 2]
[BL4-Analyse]
. Explain the role of the first moment (m_t), second moment (v_t), and bias [Optimization]
correction terms. Compare Adam with vanilla SGD and RMSProp in terms
of convergence behaviour and hyperparameter sensitivity. When would you
choose SGD over Adam?

Q2 Derive the four fundamental equations of backpropagation using the chain [12M] [Unit 2] [BL3-Apply]
[Backprop]
. rule. Show how these equations are applied in a 3-layer neural network with
sigmoid activation to update weights. Discuss the vanishing gradient
problem and state at least three solutions with mathematical justification.

Q3 Explain Batch Normalization with its complete algorithm (all 4 steps). Derive [12M] [Unit 2]
[BL4-Analyse]
. why learnable parameters gamma and beta are necessary. Compare Batch [BatchNorm]
Normalization with Instance Normalization and Group Normalization —
explain when each should be used in a medical imaging pipeline (e.g., MRI
segmentation).

Q4 Illustrate Scaled Dot-Product Attention with complete mathematical [12M] [Unit 2]


[BL4-Analyse] [Attention]
. derivation. Explain why scaling by 1/sqrt(d_k) is mathematically necessary.
Compare it with Additive (Bahdanau) Attention in terms of computational
complexity, expressiveness, and suitability for medical NLP applications.

Q5 Explain the bias-variance tradeoff in the context of overfitting in deep neural [12M] [Unit 2]
[BL5-Evaluate]
. networks. Derive the expected error decomposition. Describe five strategies [Overfitting]
to combat overfitting when you have a small medical dataset (e.g., 500
chest X-rays). Justify each strategy mathematically or empirically.

Q6 Derive and compare Batch Gradient Descent, Stochastic Gradient Descent, [8M] [Unit 2] [BL3-Apply]
[GD Variants]
. and Mini-Batch Gradient Descent. Prove that mini-batch GD provides an
unbiased estimate of the full batch gradient. Discuss the effect of batch size
on: (a) gradient variance, (b) convergence speed, (c) generalisation
performance.
Q7 Explain Dropout regularization with its mathematical formulation. Describe [8M] [Unit 2] [BL3-Apply]
[Dropout]
. the ensemble interpretation of Dropout and prove that the test-time weight
scaling (W_test = p * W_train) is needed. How does inverted Dropout differ,
and why is it preferred in modern frameworks like PyTorch?

Q8 Explain Early Stopping as a regularization technique. Under what conditions [8M] [Unit 2]
[BL4-Analyse] [Early
. is early stopping equivalent to L2 regularization? Describe the role of Stopping]
patience and validation loss monitoring. Design a complete training protocol
for a deep learning model on a 1000-sample ECG dataset using early
stopping.

Q9 Explain Multi-Head Attention and Self-Attention mechanisms. Derive the [8M] [Unit 2]
[BL4-Analyse]
. output of Multi-Head Attention with h=2 heads for a sequence of length 4 [Multi-Head]
with d_model=4. Why is self-attention superior to RNNs for capturing
long-range dependencies? Give a medical NLP example (e.g., clinical note
understanding).

Q1 Critically compare the following optimizers for training a deep CNN on a [12M] [Unit 2]
[BL5-Evaluate] [Pipeline
0. medical image segmentation task with 2000 samples and severe class Design]
imbalance: (a) SGD with momentum, (b) Adam, (c) RMSProp. Discuss
choice of loss function, learning rate schedule, and regularization strategy.
Justify your final recommended pipeline.
SECTION 7: PRACTICAL / CODING QUESTIONS

Format

These questions test implementation ability and project-level thinking. Each requires working code
(PyTorch or Keras/TensorFlow), analysis of results, and discussion of design choices. They mirror the
kind of questions asked in company coding rounds (Google, Microsoft) and research paper
reproductions.

Q1 Implement Mini-Batch SGD, Adam, and RMSProp from scratch in NumPy [Coding] [NumPy]
[Optimizers] [Analysis]
. (no deep learning libraries). Apply all three to minimise f(x,y) = x^2 + 10y^2
from starting point (5, 5). Plot the convergence curves. Report final loss
after 200 steps. Which converges fastest? Explain why based on the
mathematical properties of each optimizer.

Q2 Using PyTorch, implement a 3-layer MLP on the MNIST dataset. Train three [PyTorch] [MLP]
[Regularization] [MNIST]
. versions: (a) no regularization, (b) Dropout(p=0.5) on hidden layers, (c)
Batch Normalization after each linear layer. Compare training loss,
validation accuracy, and training time. Plot learning curves. Which
generalises best and why?

Q3 Implement Backpropagation from scratch in NumPy for a 2-layer network [NumPy] [Backprop]
[Gradient Check] [Coding]
. with sigmoid activations. Verify your implementation with gradient checking
(numerical gradient vs analytical gradient, tolerance 1e-5). Train on a toy
XOR dataset and confirm convergence. Report the forward pass, loss, and
weight updates for epoch 1.

Q4 Implement Scaled Dot-Product Attention and Multi-Head Attention in [PyTorch] [Attention]


[Transformer]
. PyTorch from scratch (without using [Link]). Test on a [Visualisation]
random input tensor of shape [batch=2, seq_len=10, d_model=64] with h=4
heads. Verify output shape is [2, 10, 64]. Visualise the attention weight
matrix for one head as a heatmap.

Q5 Load a public medical image dataset (e.g., Chest X-Ray14 or ISIC skin [PyTorch] [Augmentation]
[Medical Imaging]
. lesion). Implement a data augmentation pipeline using [ResNet]
[Link] with at least 6 augmentations. Train a ResNet-18: (a)
without augmentation, (b) with augmentation. Compare validation AUC.
Visualise 5 augmented samples before training.
Q6 Demonstrate the vanishing gradient problem: train a 10-layer network with [PyTorch] [Gradient
Analysis] [BatchNorm]
. sigmoid activation and plot the gradient magnitudes at each layer during [Visualization]
backpropagation. Then switch to ReLU activation and repeat. Explain the
difference quantitatively. Finally, add Batch Normalization and show how it
resolves the issue.

Q7 Implement early stopping from scratch in PyTorch (no callbacks). Define a [PyTorch] [Early
Stopping] [Custom Class]
. custom EarlyStopping class with parameters: patience=10, [Training Loop]
min_delta=0.001, restore_best_weights=True. Train a CNN on a small
dataset (CIFAR-10 subset of 2000 samples). Show the training curve with
the stopping point marked clearly.

Q8 Build a complete training pipeline in PyTorch that handles class imbalance [PyTorch] [Focal Loss]
[Class Imbalance]
. using Focal Loss (implement it from scratch). Use a [Medical]
WeightedRandomSampler to oversample minority classes. Test on an
imbalanced dataset (e.g., 90% negative, 10% positive chest X-ray).
Compare standard CE loss vs Focal Loss in terms of sensitivity/specificity.

Q9 Implement and compare three normalization strategies on the same CNN [PyTorch] [Normalization]
[BatchNorm]
. architecture: BatchNorm, LayerNorm, and GroupNorm (G=8). Train on [Experiment]
CIFAR-10 with batch sizes of [1, 4, 32, 128]. Plot validation accuracy vs
batch size for each normalization type. Explain the observed trends from
first principles.

Q1 Real-World Project: Design a complete training system for classifying 5 [Project] [Medical MRI]
[End-to-End] [Full
0. types of brain tumours from MRI scans (1000 images, highly imbalanced). Pipeline]
Your solution must include: (a) data augmentation pipeline (elastic
deformation + colour jitter), (b) Adam + cosine annealing LR scheduler, (c)
focal loss with class weights, (d) early stopping with patience=15, (e) batch
norm in each conv block. Report validation macro-F1 score and confusion
matrix. Justify every design choice.
SECTION 8: COMPETITIVE EXAM & RESEARCH
QUICK REFERENCE

Key Formulas — Flashcard Style


W -= eta * m_hat / (sqrt(v_hat) + eps); beta1=0.9,
Adam beta2=0.999, eps=1e-8

RMSProp W -= (eta/sqrt(E[g^2]+eps)) * g; default rho=0.9

BatchNorm y = gamma*(x-mu_B)/sqrt(var_B+eps) + beta

Scaled Dot-Attn Attention(Q,K,V) = softmax(QK^T/sqrt(d_k)) * V

Additive Attn score(q,k) = v^T * tanh(W_q*q + W_k*k)

Dropout (train) a_tilde = r elementwise a, r~Bernoulli(p)

Dropout (test) W_test = p * W_train

Focal Loss FL(p_t) = -(1-p_t)^gamma * log(p_t), gamma=2 default

Bias-Variance E[L] = Bias^2 + Variance + sigma_noise^2

BP Eq 1 delta^L = nabla_a(C) elementwise sigma'(z^L)

BP Eq 2 delta^l = ((W^{l+1})^T delta^{l+1}) elementwise sigma'(z^l)

BP Eq 3/4 dC/db = delta; dC/dW_jk = a^{l-1}_k * delta^l_j

Must-Know Facts for GATE / GRE / Research Interviews


01. Adam was proposed by Diederik Kingma & Jimmy Ba at ICLR 2015 (arXiv:1412.6980)
02. Batch Normalization: Ioffe & Szegedy, ICML 2015 — 'Accelerating Deep Network Training'
03. Dropout: Srivastava et al., JMLR 2014 — keeps prob p, test scaling = p
04. 'Attention Is All You Need': Vaswani et al., NeurIPS 2017 — introduced the Transformer
05. Bahdanau Attention: 'Neural Machine Translation by Jointly Learning to Align and Translate',
2015
06. Adam's bias correction is needed because m_0=0, v_0=0 causes underestimation at t=1,2,...
07. SGD + Momentum + LR decay often outperforms Adam in final accuracy on ImageNet
classification
08. GroupNorm (Wu & He, 2018) is superior to BatchNorm when batch size < 8 (e.g., object
detection)
09. Instance Norm (Ulyanov, 2017) is the standard for neural style transfer
10. Multi-Head Attention with h=8 heads is standard; each head uses d_k = d_model/h
11. Gradient Clipping: clip grad norm to value (e.g., 1.0) — standard for RNNs and Transformers
12. Focal Loss: Lin et al., ICCV 2017 — introduced for RetinaNet; gamma=2 is standard default
13. The scaling factor 1/sqrt(d_k) was proven necessary by Vaswani et al. — prevents softmax
saturation
14. [Link] Chapter 11 has complete, runnable PyTorch implementations of all attention variants
15. ELMo, BERT, GPT all use self-attention as their core operation — Unit 2 is their foundation

Reference Book Map — Recap


Book Best Topics in Unit 2 Access

Goodfellow et al. — Deep Learning GD,


(2016)
Adam, BatchNorm, Dropout, Overfitting — Chapters
Free:
7&[Link]
8

Nielsen — Neural Networks & DL Backprop 4 equations, GD intuition — Chapters 1–3 Free: neuralnetworksanddeeplearning.c

[Link] — Zhang et al. Attention (Ch 11), Optimization (Ch 12), BatchNorm (Ch
Free:
8.5)[Link]

Aggarwal — Neural Networks (2018)


Dropout, BatchNorm, Regularization — Chapter 3–4 Prescribed textbook

Calin — DL Architectures (2020) Mathematical proofs: GD convergence, BP derivationsPrescribed textbook

Vaswani et al. (2017) Original Scaled Dot-Product & Multi-Head Attention arXiv:1706.03762 (free)

Bahdanau et al. (2015) Original Additive Attention — Align & Translate arXiv:1409.0473 (free)

All the best, Geeta! This document covers every topic in Unit 2 of 21BME381T at the depth
needed for university exams, real-world deep learning projects, and competitive exams like
GATE. Work through the theory questions first, then implement the practical ones. The
Goodfellow Ch. 7–8 + [Link] Ch. 11 combination will give you complete mastery. ■

You might also like