0% found this document useful (0 votes)

6 views20 pages

Unit2 DeepLearning ComprehensiveNotes

The document provides comprehensive notes on training deep neural networks, covering optimization techniques, backpropagation, effective training methods, and strategies to handle overfitting and small datasets. It includes detailed explanations of various algorithms such as Gradient Descent, Adam, and RMSProp, as well as training techniques like early stopping and dropout. Additionally, it addresses the importance of attention mechanisms and data strategies in deep learning applications in medicine.

Uploaded by

pankhurimithiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views20 pages

Unit2 DeepLearning ComprehensiveNotes

Uploaded by

pankhurimithiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

21BME381T

Deep Learning Techniques in Medicine

UNIT 2 — COMPREHENSIVE
NOTES

Training Deep Neural Networks

Optimization Regularization Attention Mechanism Data Strategies

Course 21BME381T — Deep Learning Techniques in Medicine

Unit Unit 2 — Training Deep Neural Networks (9 Hours)

Prepared for Geeta, [Link] Biomedical Engineering, SRM IST

Coverage University Exam + Real-World Projects + Competitive Exams

References Goodfellow et al. | Nielsen | [Link] | Aggarwal | Calin

Table of Contents

1. Optimization Techniques
1.1 Gradient Descent — Full Derivation & Variants
1.2 The Delta Rule & Learning Rates
1.3 Batch, Stochastic & Mini-Batch Optimization
1.4 Adaptive Moment Estimation (Adam)
1.5 RMSProp

2. Backpropagation Algorithm
2.1 The Four Fundamental Equations
2.2 Complete Chain Rule Derivation

3. Effective Training Techniques

3.1 Early Stopping
3.2 Dropout
3.3 Batch Normalization
3.4 Instance Normalization
3.5 Group Normalization — Comparison Table

4. Handling Overfitting & Small Datasets

4.1 Overfitting — Bias-Variance Tradeoff
4.2 Data Augmentation
4.3 Redesigning the Loss Function
4.4 Generating Synthetic Data

5. Attention Mechanism
5.1 Query, Key, Value Framework
5.2 Dot-Product (Scaled) Attention
5.3 Additive (Bahdanau) Attention
5.4 Multi-Head Attention
5.5 Self-Attention

6. 10 Theory Practice Questions (University Exam)

7. 10 Practical / Coding Questions
8. Competitive Exam & Research Quick Reference
SECTION 1: OPTIMIZATION TECHNIQUES

Why Optimization?

A neural network learns by minimizing a Loss Function L(W) over millions of parameters W. Optimization
algorithms update these weights iteratively so the network's predictions get closer to the true labels. The
choice of optimizer is one of the most critical engineering decisions in deep learning — it directly affects
training speed, stability, and final accuracy.

1.1 Gradient Descent — Full Derivation & Variants

Gradient Descent (GD) is the backbone of all neural network training. The idea is simple: find the
direction in weight space that decreases the loss most steeply, and take a small step in that direction.
Core Intuition: Imagine the loss surface as a hilly landscape. You want to walk downhill (minimise
loss). Gradient gives the uphill direction, so you move opposite to it.

Mathematical Formulation:

Weight Update Rule W_(t+1) = W_t - eta * gradient_L(W_t)

Gradient w.r.t. W gradient_L(W) = (1/N) * SUM[ dL(y_i, y_hat_i) / dW ]

Where eta (eta) is the learning rate — the single most important hyperparameter. N is the number of
samples. L is the loss (e.g., MSE or cross-entropy).

Learning Rate Effects:

• eta too large → Loss oscillates or diverges (overshoots the minimum)
• eta too small → Training is extremely slow; might get stuck in local minima
• eta just right → Smooth convergence to a good minimum
• Solution: Learning Rate Schedulers (step decay, cosine annealing, warm restarts)

1.2 The Delta Rule & Learning Rates

The Delta Rule (Widrow-Hoff Rule) is the precursor to backpropagation, applicable to single-layer
linear networks. It forms the conceptual foundation for gradient-based weight updates.

Delta Rule delta_W = eta * (target - output) * input

This says: adjust weight by how wrong we were (target - output), scaled by the input and learning
rate. For multi-layer networks, this generalises into backpropagation.

Learning Rate Scheduling Strategies:

Strategy Formula / Idea When to Use

Step Decay eta = eta_0 * gamma^floor(epoch/step) Simple; fixed milestones

Exponential Decay eta = eta_0 * e^(-lambda*t) Smooth, gradual reduction

Cosine Annealing eta = eta_min + 0.5(eta_max-eta_min)(1+cos(pi*t/T))

State-of-the-art; avoids sharp drops

Warm Restarts Cosine + periodic reset Escaping local minima

ReduceLROnPlateau Reduce eta when val_loss stops improving Practical default choice

1.3 Batch, Stochastic & Mini-Batch Gradient Descent

The key distinction between GD variants is how many samples are used to compute the gradient
before updating weights. This is arguably the most practically important decision after choosing your
architecture.

Property Batch GD Stochastic (SGD) Mini-Batch GD

Samples per update All N samples 1 sample B samples (32–512)

Gradient quality Exact, stable Very noisy Good approximation

Speed per epoch Slow (all data) Fast (1 step) Fastest in practice

Memory usage High Very low Moderate

Convergence Smooth but slow Noisy, can escape minima Best balance

Parallelisation (GPU) Good Poor Excellent

Used in practice? Rarely Rarely (pure) Almost always

Typical use case Small datasets Online learning Deep learning default

Key Insight: Why Mini-Batch Wins

Mini-batch GD combines the best of both worlds: the gradient estimate is accurate enough for stable
convergence, while the batch size allows GPU parallelism. The noise in the gradient (from not seeing all
data) actually helps escape sharp local minima — a phenomenon called implicit regularization. Typical
batch sizes: 32, 64, 128, 256.

1.4 Adaptive Moment Estimation (Adam)

Adam (Kingma & Ba, 2014) is the de facto standard optimizer in deep learning. It combines the
benefits of two other methods: Momentum (accelerates in the right direction) and RMSProp (adapts
learning rate per parameter). The result is fast, stable training that works well with minimal
hyperparameter tuning.

Adam Algorithm — Step by Step:

Step 1: Compute
gradient g_t = gradient_L(W_t)
Step 2: Update 1st
moment (mean) m_t = beta_1 * m_(t-1) + (1 - beta_1) * g_t

Step 3: Update 2nd

moment (variance) v_t = beta_2 * v_(t-1) + (1 - beta_2) * g_t^2

Step 4: Bias m_hat_t = m_t / (1 - beta_1^t), v_hat_t = v_t / (1 -

correction beta_2^t)

Step 5: Weight
update W_t = W_(t-1) - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)

Default Hyperparameters (use these unless you have a reason not to):
• Learning rate eta = 0.001
• beta_1 = 0.9 (momentum decay — how much past gradient is remembered)
• beta_2 = 0.999 (variance decay — tracks squared gradient magnitude)
• epsilon = 1e-8 (prevents division by zero)

What does Adam actually do?

m_t is an exponential moving average of past gradients — this gives Adam momentum, smoothing
out noisy gradients. v_t is an exponential moving average of squared gradients — this scales the
learning rate: parameters with large gradients get a smaller effective eta, and parameters with small
gradients get a larger eta. The bias correction terms (Step 4) compensate for the zero-initialisation of
m and v at t=0, which would otherwise bias estimates toward zero in early training.

Adam vs SGD — Which to Use?

Adam: Use for quick experimentation, NLP, transformers, and when you need fast convergence. SGD
with momentum + LR scheduler: Often achieves better final accuracy in computer vision (e.g., ResNet
on ImageNet) because the sharper minimum Adam finds may generalise worse. Rule of thumb:
Prototype with Adam, then fine-tune with SGD for production models.

1.5 RMSProp
RMSProp (Root Mean Square Propagation, Hinton 2012) adapts the learning rate for each parameter
independently based on the recent magnitude of gradients. It was specifically designed to work well
with RNNs and non-stationary problems.

Cache update E[g^2]_t = rho * E[g^2]_(t-1) + (1-rho) * g_t^2

Weight update W_t = W_(t-1) - (eta / sqrt(E[g^2]_t + epsilon)) * g_t

RMSProp divides the learning rate by a running average of recent gradient magnitudes. Parameters
that have been receiving large gradient updates get a smaller effective learning rate, preventing them
from overshooting. Default rho = 0.9.
SECTION 2: BACKPROPAGATION ALGORITHM

Backpropagation is the algorithm that makes training deep networks computationally feasible. It
efficiently computes the gradient of the loss with respect to every weight in the network by applying
the chain rule backwards through the computation graph.

2.1 The Four Fundamental Equations (Nielsen, 2015)

Notation

BP1 — Output layer error

Formula delta^L_j = (dL/da^L_j) * sigma'(z^L_j)

How wrong is each output neuron? Multiply how much loss changes w.r.t. output by the derivative of
activation. For MSE + sigmoid: delta^L = (a^L - y) * sigma'(z^L)

BP2 — Backpropagate error

delta^l = ((W^{l+1})^T * delta^{l+1}) elementwise_product

Formula sigma'(z^l)

Pass error backwards through weights. The transpose W^T reverses the forward direction.

BP3 — Gradient of bias

Formula dL/db^l_j = delta^l_j

The gradient w.r.t. bias equals the error at that neuron directly.

BP4 — Gradient of weight

Formula dL/dw^l_jk = a^{l-1}_k * delta^l_j

The gradient w.r.t. weight = activation of previous layer * error of current layer.

2.2 Full Backpropagation Algorithm

1. FORWARD PASS: Feed input x through the network. Store all z^l and a^l values at each layer.
2. OUTPUT ERROR: Compute delta^L using BP1.
3. BACKWARD PASS: For l = L-1, L-2, ..., 2 — compute delta^l using BP2.
4. GRADIENTS: Compute dL/dw and dL/db using BP3 and BP4.
5. UPDATE: Apply optimizer (GD, Adam, etc.) to update W and b.
6. REPEAT for each mini-batch until convergence.
Vanishing & Exploding Gradients — The Core Problem

Vanishing Gradients: In deep networks with sigmoid/tanh, sigma'(z) < 0.25 always. After L layers of
multiplication, the gradient shrinks exponentially — early layers learn nothing. Fix: Use ReLU (sigma'(z)
= 1 for z>0), Batch Normalization, residual connections. Exploding Gradients: Weights are large ->
gradients grow exponentially. Fix: Gradient clipping (clip ||g|| to max_norm), weight regularization,
BatchNorm.
SECTION 3: EFFECTIVE TRAINING TECHNIQUES

3.1 Early Stopping

Early stopping halts training when the validation loss stops improving, preventing the model from
memorising the training set. It is the simplest and most universally applicable form of regularization.
• Monitor: Validation loss (not training loss)
• Patience: Number of epochs to wait for improvement before stopping (typical: 5–20)
• Restore best weights: Rollback to the epoch with minimum validation loss
• Goodfellow et al.: Early stopping is equivalent to L2 regularization under certain conditions
• Keras: EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

3.2 Dropout (Srivastava et al., 2014)

Dropout randomly deactivates neurons during training with probability (1-p). This forces the network
to learn redundant representations — no single neuron can rely on its neighbours, so every neuron
must be independently useful.

Dropout mask r_j ~ Bernoulli(p) — r_j is 1 with prob p, 0 otherwise

Masked activation a_tilde^l = r^l elementwise_product a^l

Test time scaling W_test = p * W_train (multiply weights by keep-prob p)

Key Properties:
• Ensemble interpretation: Dropout trains 2^N different sub-networks; test uses their geometric mean
• Typical keep probability: p=0.5 for hidden layers, p=0.8 for input layers
• Works best for fully connected layers; less beneficial for BatchNorm-equipped conv layers
• Inverted dropout (modern): divide by p during training instead of multiplying during test
• Do NOT use dropout and batch normalization in the same layer — they interact poorly

3.3 Batch Normalization (Ioffe & Szegedy, 2015)

Batch Normalization normalises the input to each layer across the mini-batch. This stabilises training
by addressing internal covariate shift — the changing distribution of layer inputs as weights update.
It is one of the most impactful innovations in modern DL.

BatchNorm Algorithm (per mini-batch B):

Step 1: Batch mean mu_B = (1/m) * SUM x_i (for i in batch B)

Step 2: Batch
variance sigma^2_B = (1/m) * SUM (x_i - mu_B)^2

Step 3: Normalise x_hat_i = (x_i - mu_B) / sqrt(sigma^2_B + epsilon)

Step 4: Scale & shift y_i = gamma * x_hat_i + beta (gamma, beta are learnable)

Why gamma and beta (scale & shift)?

After normalisation, x_hat has zero mean and unit variance. But this might destroy useful learned
representations. gamma and beta are learnable parameters that allow the network to undo the
normalisation if needed — giving it full flexibility.

Benefits of BatchNorm:
• Allows higher learning rates → faster training
• Reduces dependence on careful weight initialisation
• Acts as a regularizer → often reduces need for Dropout
• At test time: uses running mean/variance estimated during training (not batch statistics)
• Placement: Typically after linear/conv layer, before activation (debated in literature)

3.4 & 3.5 Instance, Group Normalization — Comparison

BatchNorm fails when batch sizes are very small (e.g., object detection with batch=1). Alternative
normalization strategies compute statistics over different axes:

Normalization Normalises Over Best For Batch Size

Sensitivity

Batch Norm Batch + spatial Classification, High

(N, H, W) large batches (fails at B=1)

Layer Norm All channels + NLP, Transformers, None

spatial (C, H, W) RNNs

Instance Norm Spatial only per Style transfer, None

channel (H, W) image generation

Group Norm Channels in groups Object detection, None

+ spatial small batch (B=1 works)
SECTION 4: OVERFITTING & SMALL DATASET
STRATEGIES

4.1 Overfitting — Bias-Variance Tradeoff

Overfitting occurs when a model learns the training data too well — including its noise — and fails to
generalise to new data. It is the central challenge in machine learning.

Decomposition Expected Error = Bias^2 + Variance + Irreducible Noise

High Bias (Underfitting) High Variance (Overfitting)Ideal

Train error High Low Low

Val error High High Low

Train-Val gap Small Large Small

Model Too simple Too complex Just right

Fix More capacity, features Regularize, more data N/A

4.2 Data Augmentation

Data augmentation artificially increases the effective dataset size by applying label-preserving
transformations to existing samples.

Standard Image Augmentations:

• Geometric: Random crop, horizontal/vertical flip, rotation (±15°), translation, zoom
• Colour: Brightness, contrast, saturation jitter; random grayscale; colour channel dropout
• Noise: Gaussian noise, Cutout (random rectangular erasure), GridDistortion
• Advanced: MixUp (linearly interpolate two images + labels), CutMix (paste regions between
images)
• Medical-specific: Elastic deformation, intensity normalisation, random k-space masking (MRI)

Class Imbalance Solutions:

• Oversampling minority class (duplicate/augment minority samples)
• Undersampling majority class
• SMOTE: Synthetic Minority Oversampling — interpolate between existing minority samples
• Weighted loss function: Assign higher loss to misclassified minority class samples
• class_weight parameter in Keras/sklearn handles this automatically

4.3 Redesigning the Loss Function

When data is limited or imbalanced, modifying the loss function can be more powerful than collecting
more data. Here are key alternatives:
Loss Function Formula Best For

Cross Entropy -SUM y_i * log(p_i) Standard classification

Focal Loss -(1-p_t)^gamma * log(p_t) Class imbalance; object detection

Dice Loss 1 - 2|X∩Y| / (|X|+|Y|) Medical image segmentation

Label Smoothing CE -SUM [(1-eps)y_i + eps/K] log(p_i) Overconfidence; small data

Triplet Loss max(d(a,p)-d(a,n)+margin, 0) Metric learning; few-shot

Contrastive Loss yd^2 + (1-y)max(m-d,0)^2 Similarity learning

4.4 Generating Synthetic Data

• GANs (Generative Adversarial Networks): Generate realistic synthetic images for rare conditions
• VAEs (Variational Autoencoders): Sample from learned latent distribution
• Simulation: Physics-based simulators for surgical robots, scanners (e.g., BlenderProc for medical)
• Transfer Learning from related domains (e.g., natural images → medical images via fine-tuning)
• Domain Randomisation: Vary rendering parameters widely so real data falls within distribution
SECTION 5: ATTENTION MECHANISM

Attention Mechanism is the foundational innovation behind Transformers, GPT, BERT, and virtually
all modern AI systems. Originally proposed for machine translation (Bahdanau et al., 2015), it solves
the bottleneck of encoding long sequences into a fixed-size vector.

5.1 Query, Key, Value (Q, K, V) Framework

The QKV framework is a beautiful abstraction. Think of it like a search engine:

Component Analogy Mathematical Role

Query (Q) Search query you type into Google What information am I looking for?

Key (K) Keywords/tags on each webpage What information does each position contain?

Value (V) Actual content of each webpage What information gets retrieved if selected?

In a self-attention layer, Q, K, V are all derived from the same input X via learned linear projections: Q
= X * W_Q, K = X * W_K, V = X * W_V. The weight matrices W_Q, W_K, W_V (all shape d_model x
d_k) are the learnable parameters.

5.2 Scaled Dot-Product Attention

Scaled Dot-Product Attention (Vaswani et al., 2017 — 'Attention Is All You Need') is the most
computationally efficient attention variant and the basis of all Transformer models.

Score function score(Q, K) = Q * K^T / sqrt(d_k)

Attention weights alpha = softmax(Q * K^T / sqrt(d_k))

Context vector Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Why scale by sqrt(d_k)?

For large d_k, the dot products Q·K grow large in magnitude, pushing the softmax into regions of very
small gradients (saturation). Dividing by sqrt(d_k) keeps the variance of the dot products at 1
regardless of dimensionality, preventing gradient vanishing through the softmax. Mathematically: if Q
and K are vectors of i.i.d. components with mean 0, variance 1, their dot product has variance d_k —
dividing by sqrt(d_k) gives variance 1.

Step-by-Step Computation:
1. Linear projections: Q=XW_Q, K=XW_K, V=XW_V (shapes: [seq_len, d_k])
2. Compute scores: S = QK^T / sqrt(d_k) (shape: [seq_len, seq_len] — every position attends to
every other)
3. Apply softmax: A = softmax(S) (rows sum to 1 — attention weight distribution)
4. Optional: Apply mask (for decoder — prevent attending to future positions)
5. Weighted sum: Output = A * V (shape: [seq_len, d_v])
5.3 Additive (Bahdanau) Attention
Bahdanau et al. (2015) introduced attention in the context of neural machine translation. Their
additive attention uses an MLP to compute alignment scores between query and key, making it more
flexible than the dot-product version but computationally heavier.

Score (additive) score(q, k) = v_a^T * tanh(W_q * q + W_k * k)

Attention weights alpha_i = softmax(score(q, k_i))

Context vector c = SUM alpha_i * v_i

Mechanism Explained:
W_q and W_k project q and k into the same space; they are added (not multiplied), then passed
through tanh. The vector v_a (learnable) converts the hidden representation to a scalar score. This
extra MLP gives additive attention more expressiveness than dot-product attention for small d_k, but
requires O(n^2 * d) operations, making it slower for long sequences.

Dot-Product vs Additive Attention — Complete Comparison:

Property Dot-Product (Vaswani 2017) Additive (Bahdanau 2015)

Score function Q * K^T / sqrt(d_k) v^T * tanh(W_qq + W_kk)

Complexity O(n^2 * d) O(n^2 * d) — but larger constant

Speed Faster (matrix multiply) Slower (MLP per pair)

For large d_k Needs scaling factor Naturally bounded

Expressiveness Lower (linear) Higher (non-linear via tanh)

Used in Transformers, BERT, GPT Seq2Seq NMT, early NLP

Parameter count Just projection matrices Additional W_q, W_k, v_a

Masking support Yes (add -inf before softmax) Yes

5.4 Multi-Head Attention

Instead of performing a single attention function, Multi-Head Attention (MHA) runs h attention heads
in parallel, each learning to attend to different aspects of the input (e.g., one head for syntax, another
for semantics).

Each head head_i = Attention(QW_Qî, KW_Kî, V*W_Vî)

Multi-head output MHA(Q,K,V) = Concat(head_1,...,head_h) * W_O

W_O (d_model x d_model) is an output projection matrix. Typically h=8 heads with d_k = d_v =
d_model/h = 64 for d_model=512. Total parameter count is same as single-head attention — we just
split and recombine.
5.5 Self-Attention
In self-attention, Q, K, V all come from the same sequence. Each position attends to all other
positions in the sequence — enabling the model to build rich, long-range representations without any
recurrence (unlike RNNs).

Self-Attention vs Cross-Attention

Self-Attention: Q, K, V from the SAME sequence. Used in Transformer encoder. Cross-Attention: Q from
decoder, K and V from encoder output. Used in Transformer decoder. This is how the decoder 'looks at'
the source sentence while generating translations.
SECTION 6: THEORY PRACTICE QUESTIONS —
UNIVERSITY EXAM

How to Use This Section

These questions are modelled on SRM IST University exam patterns at Bloom's Levels 2–5
(Understand, Apply, Analyse, Evaluate). Questions marked [12M] require a structured essay with
diagrams/derivations. [8M] require detailed explanations. [2M] = short answers.

Q1 Derive the complete Adam optimizer update rule from first principles. [12M] [Unit 2]
[BL4-Analyse]
. Explain the role of the first moment (m_t), second moment (v_t), and bias [Optimization]
correction terms. Compare Adam with vanilla SGD and RMSProp in terms
of convergence behaviour and hyperparameter sensitivity. When would you
choose SGD over Adam?

Q2 Derive the four fundamental equations of backpropagation using the chain [12M] [Unit 2] [BL3-Apply]
[Backprop]
. rule. Show how these equations are applied in a 3-layer neural network with
sigmoid activation to update weights. Discuss the vanishing gradient
problem and state at least three solutions with mathematical justification.

Q3 Explain Batch Normalization with its complete algorithm (all 4 steps). Derive [12M] [Unit 2]
[BL4-Analyse]
. why learnable parameters gamma and beta are necessary. Compare Batch [BatchNorm]
Normalization with Instance Normalization and Group Normalization —
explain when each should be used in a medical imaging pipeline (e.g., MRI
segmentation).

Q4 Illustrate Scaled Dot-Product Attention with complete mathematical [12M] [Unit 2]

[BL4-Analyse] [Attention]
. derivation. Explain why scaling by 1/sqrt(d_k) is mathematically necessary.
Compare it with Additive (Bahdanau) Attention in terms of computational
complexity, expressiveness, and suitability for medical NLP applications.

Q5 Explain the bias-variance tradeoff in the context of overfitting in deep neural [12M] [Unit 2]
[BL5-Evaluate]
. networks. Derive the expected error decomposition. Describe five strategies [Overfitting]
to combat overfitting when you have a small medical dataset (e.g., 500
chest X-rays). Justify each strategy mathematically or empirically.

Q6 Derive and compare Batch Gradient Descent, Stochastic Gradient Descent, [8M] [Unit 2] [BL3-Apply]
[GD Variants]
. and Mini-Batch Gradient Descent. Prove that mini-batch GD provides an
unbiased estimate of the full batch gradient. Discuss the effect of batch size
on: (a) gradient variance, (b) convergence speed, (c) generalisation
performance.
Q7 Explain Dropout regularization with its mathematical formulation. Describe [8M] [Unit 2] [BL3-Apply]
[Dropout]
. the ensemble interpretation of Dropout and prove that the test-time weight
scaling (W_test = p * W_train) is needed. How does inverted Dropout differ,
and why is it preferred in modern frameworks like PyTorch?

Q8 Explain Early Stopping as a regularization technique. Under what conditions [8M] [Unit 2]
[BL4-Analyse] [Early
. is early stopping equivalent to L2 regularization? Describe the role of Stopping]
patience and validation loss monitoring. Design a complete training protocol
for a deep learning model on a 1000-sample ECG dataset using early
stopping.

Q9 Explain Multi-Head Attention and Self-Attention mechanisms. Derive the [8M] [Unit 2]
[BL4-Analyse]
. output of Multi-Head Attention with h=2 heads for a sequence of length 4 [Multi-Head]
with d_model=4. Why is self-attention superior to RNNs for capturing
long-range dependencies? Give a medical NLP example (e.g., clinical note
understanding).

Q1 Critically compare the following optimizers for training a deep CNN on a [12M] [Unit 2]
[BL5-Evaluate] [Pipeline
0. medical image segmentation task with 2000 samples and severe class Design]
imbalance: (a) SGD with momentum, (b) Adam, (c) RMSProp. Discuss
choice of loss function, learning rate schedule, and regularization strategy.
Justify your final recommended pipeline.
SECTION 7: PRACTICAL / CODING QUESTIONS

Format

These questions test implementation ability and project-level thinking. Each requires working code
(PyTorch or Keras/TensorFlow), analysis of results, and discussion of design choices. They mirror the
kind of questions asked in company coding rounds (Google, Microsoft) and research paper
reproductions.

Q1 Implement Mini-Batch SGD, Adam, and RMSProp from scratch in NumPy [Coding] [NumPy]
[Optimizers] [Analysis]
. (no deep learning libraries). Apply all three to minimise f(x,y) = x^2 + 10y^2
from starting point (5, 5). Plot the convergence curves. Report final loss
after 200 steps. Which converges fastest? Explain why based on the
mathematical properties of each optimizer.

Q2 Using PyTorch, implement a 3-layer MLP on the MNIST dataset. Train three [PyTorch] [MLP]
[Regularization] [MNIST]
. versions: (a) no regularization, (b) Dropout(p=0.5) on hidden layers, (c)
Batch Normalization after each linear layer. Compare training loss,
validation accuracy, and training time. Plot learning curves. Which
generalises best and why?

Q3 Implement Backpropagation from scratch in NumPy for a 2-layer network [NumPy] [Backprop]
[Gradient Check] [Coding]
. with sigmoid activations. Verify your implementation with gradient checking
(numerical gradient vs analytical gradient, tolerance 1e-5). Train on a toy
XOR dataset and confirm convergence. Report the forward pass, loss, and
weight updates for epoch 1.

Q4 Implement Scaled Dot-Product Attention and Multi-Head Attention in [PyTorch] [Attention]

[Transformer]
. PyTorch from scratch (without using [Link]). Test on a [Visualisation]
random input tensor of shape [batch=2, seq_len=10, d_model=64] with h=4
heads. Verify output shape is [2, 10, 64]. Visualise the attention weight
matrix for one head as a heatmap.

Q5 Load a public medical image dataset (e.g., Chest X-Ray14 or ISIC skin [PyTorch] [Augmentation]
[Medical Imaging]
. lesion). Implement a data augmentation pipeline using [ResNet]
[Link] with at least 6 augmentations. Train a ResNet-18: (a)
without augmentation, (b) with augmentation. Compare validation AUC.
Visualise 5 augmented samples before training.
Q6 Demonstrate the vanishing gradient problem: train a 10-layer network with [PyTorch] [Gradient
Analysis] [BatchNorm]
. sigmoid activation and plot the gradient magnitudes at each layer during [Visualization]
backpropagation. Then switch to ReLU activation and repeat. Explain the
difference quantitatively. Finally, add Batch Normalization and show how it
resolves the issue.

Q7 Implement early stopping from scratch in PyTorch (no callbacks). Define a [PyTorch] [Early
Stopping] [Custom Class]
. custom EarlyStopping class with parameters: patience=10, [Training Loop]
min_delta=0.001, restore_best_weights=True. Train a CNN on a small
dataset (CIFAR-10 subset of 2000 samples). Show the training curve with
the stopping point marked clearly.

Q8 Build a complete training pipeline in PyTorch that handles class imbalance [PyTorch] [Focal Loss]
[Class Imbalance]
. using Focal Loss (implement it from scratch). Use a [Medical]
WeightedRandomSampler to oversample minority classes. Test on an
imbalanced dataset (e.g., 90% negative, 10% positive chest X-ray).
Compare standard CE loss vs Focal Loss in terms of sensitivity/specificity.

Q9 Implement and compare three normalization strategies on the same CNN [PyTorch] [Normalization]
[BatchNorm]
. architecture: BatchNorm, LayerNorm, and GroupNorm (G=8). Train on [Experiment]
CIFAR-10 with batch sizes of [1, 4, 32, 128]. Plot validation accuracy vs
batch size for each normalization type. Explain the observed trends from
first principles.

Q1 Real-World Project: Design a complete training system for classifying 5 [Project] [Medical MRI]
[End-to-End] [Full
0. types of brain tumours from MRI scans (1000 images, highly imbalanced). Pipeline]
Your solution must include: (a) data augmentation pipeline (elastic
deformation + colour jitter), (b) Adam + cosine annealing LR scheduler, (c)
focal loss with class weights, (d) early stopping with patience=15, (e) batch
norm in each conv block. Report validation macro-F1 score and confusion
matrix. Justify every design choice.
SECTION 8: COMPETITIVE EXAM & RESEARCH
QUICK REFERENCE

Key Formulas — Flashcard Style

W -= eta * m_hat / (sqrt(v_hat) + eps); beta1=0.9,
Adam beta2=0.999, eps=1e-8

RMSProp W -= (eta/sqrt(E[g^2]+eps)) * g; default rho=0.9

BatchNorm y = gamma*(x-mu_B)/sqrt(var_B+eps) + beta

Scaled Dot-Attn Attention(Q,K,V) = softmax(QK^T/sqrt(d_k)) * V

Additive Attn score(q,k) = v^T * tanh(W_qq + W_kk)

Dropout (train) a_tilde = r elementwise a, r~Bernoulli(p)

Dropout (test) W_test = p * W_train

Focal Loss FL(p_t) = -(1-p_t)^gamma * log(p_t), gamma=2 default

Bias-Variance E[L] = Bias^2 + Variance + sigma_noise^2

BP Eq 1 delta^L = nabla_a(C) elementwise sigma'(z^L)

BP Eq 2 delta^l = ((W^{l+1})^T delta^{l+1}) elementwise sigma'(z^l)

BP Eq 3/4 dC/db = delta; dC/dW_jk = a^{l-1}_k * delta^l_j

Must-Know Facts for GATE / GRE / Research Interviews

01. Adam was proposed by Diederik Kingma & Jimmy Ba at ICLR 2015 (arXiv:1412.6980)
02. Batch Normalization: Ioffe & Szegedy, ICML 2015 — 'Accelerating Deep Network Training'
03. Dropout: Srivastava et al., JMLR 2014 — keeps prob p, test scaling = p
04. 'Attention Is All You Need': Vaswani et al., NeurIPS 2017 — introduced the Transformer
05. Bahdanau Attention: 'Neural Machine Translation by Jointly Learning to Align and Translate',
2015
06. Adam's bias correction is needed because m_0=0, v_0=0 causes underestimation at t=1,2,...
07. SGD + Momentum + LR decay often outperforms Adam in final accuracy on ImageNet
classification
08. GroupNorm (Wu & He, 2018) is superior to BatchNorm when batch size < 8 (e.g., object
detection)
09. Instance Norm (Ulyanov, 2017) is the standard for neural style transfer
10. Multi-Head Attention with h=8 heads is standard; each head uses d_k = d_model/h
11. Gradient Clipping: clip grad norm to value (e.g., 1.0) — standard for RNNs and Transformers
12. Focal Loss: Lin et al., ICCV 2017 — introduced for RetinaNet; gamma=2 is standard default
13. The scaling factor 1/sqrt(d_k) was proven necessary by Vaswani et al. — prevents softmax
saturation
14. [Link] Chapter 11 has complete, runnable PyTorch implementations of all attention variants
15. ELMo, BERT, GPT all use self-attention as their core operation — Unit 2 is their foundation

Reference Book Map — Recap

Book Best Topics in Unit 2 Access

Goodfellow et al. — Deep Learning GD,

(2016)
Adam, BatchNorm, Dropout, Overfitting — Chapters
Free:
7&[Link]
8

Nielsen — Neural Networks & DL Backprop 4 equations, GD intuition — Chapters 1–3 Free: neuralnetworksanddeeplearning.c

[Link] — Zhang et al. Attention (Ch 11), Optimization (Ch 12), BatchNorm (Ch
Free:
8.5)[Link]

Aggarwal — Neural Networks (2018)

Dropout, BatchNorm, Regularization — Chapter 3–4 Prescribed textbook

Calin — DL Architectures (2020) Mathematical proofs: GD convergence, BP derivationsPrescribed textbook

Vaswani et al. (2017) Original Scaled Dot-Product & Multi-Head Attention arXiv:1706.03762 (free)

Bahdanau et al. (2015) Original Additive Attention — Align & Translate arXiv:1409.0473 (free)

All the best, Geeta! This document covers every topic in Unit 2 of 21BME381T at the depth
needed for university exams, real-world deep learning projects, and competitive exams like
GATE. Work through the theory questions first, then implement the practical ones. The
Goodfellow Ch. 7–8 + [Link] Ch. 11 combination will give you complete mastery. ■

Training Supervised Deep Learning Models
No ratings yet
Training Supervised Deep Learning Models
25 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
70 pages
Optimizing Neural Network Training Techniques
No ratings yet
Optimizing Neural Network Training Techniques
34 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
67 pages
Unit 2 - DLTM
No ratings yet
Unit 2 - DLTM
62 pages
Training Neural Networks with Gradient Descent
No ratings yet
Training Neural Networks with Gradient Descent
4 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
54 pages
Deep Learning Fundamentals and Techniques
No ratings yet
Deep Learning Fundamentals and Techniques
212 pages
Backpropagation and Optimization
No ratings yet
Backpropagation and Optimization
7 pages
Unit II
No ratings yet
Unit II
14 pages
Single Feed Forward
No ratings yet
Single Feed Forward
147 pages
Gradient Descent in Deep Learning
No ratings yet
Gradient Descent in Deep Learning
28 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
18 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
19 pages
Backpropagation in Deep Learning Explained
No ratings yet
Backpropagation in Deep Learning Explained
48 pages
DNN RNN CNN PPT
No ratings yet
DNN RNN CNN PPT
162 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
27 pages
CS 182: Backpropagation & Optimization
No ratings yet
CS 182: Backpropagation & Optimization
6 pages
Supervised Deep Learning Techniques
No ratings yet
Supervised Deep Learning Techniques
28 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
23 pages
Unit Ii
No ratings yet
Unit Ii
31 pages
Backpropagation and Gradient Descent Explained
No ratings yet
Backpropagation and Gradient Descent Explained
10 pages
DNN M10 Optimizers
No ratings yet
DNN M10 Optimizers
62 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
22 pages
Neural Network Training and Optimization
No ratings yet
Neural Network Training and Optimization
34 pages
Optimization For Deep Learning Models
No ratings yet
Optimization For Deep Learning Models
50 pages
Gradient Descent Method: For Tuning Regression Models
No ratings yet
Gradient Descent Method: For Tuning Regression Models
64 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
21 pages
CBOW vs Skip-Gram in Word2Vec
No ratings yet
CBOW vs Skip-Gram in Word2Vec
170 pages
Optimizing Neural Network Training Techniques
No ratings yet
Optimizing Neural Network Training Techniques
59 pages
Module 1
No ratings yet
Module 1
19 pages
Neural Network Training Fundamentals
No ratings yet
Neural Network Training Fundamentals
14 pages
Gradient Descent For Large Scale Learning
No ratings yet
Gradient Descent For Large Scale Learning
27 pages
Unit 2
No ratings yet
Unit 2
10 pages
Machine Learning Classifiers Overview
No ratings yet
Machine Learning Classifiers Overview
81 pages
Optimizing Neural Network Learning Rates
No ratings yet
Optimizing Neural Network Learning Rates
40 pages
Unit 2
No ratings yet
Unit 2
23 pages
5 SEng5305 Chapter 5 Optimization Techniques
No ratings yet
5 SEng5305 Chapter 5 Optimization Techniques
47 pages
SGD Variants in Neural Networks
No ratings yet
SGD Variants in Neural Networks
211 pages
DL-Module 2
No ratings yet
DL-Module 2
30 pages
Deep Learningmod 2
No ratings yet
Deep Learningmod 2
111 pages
Deep Learning: Gradient Optimization Techniques
No ratings yet
Deep Learning: Gradient Optimization Techniques
40 pages
BCSE332L-Deep Learning Module 3
No ratings yet
BCSE332L-Deep Learning Module 3
69 pages
2 Chapter2
No ratings yet
2 Chapter2
26 pages
Unit 2
No ratings yet
Unit 2
95 pages
Deep Learning Backpropagation Techniques
No ratings yet
Deep Learning Backpropagation Techniques
54 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
26 pages
Overview of Gradient Descent Methods
No ratings yet
Overview of Gradient Descent Methods
3 pages
Neural Network Architectures & Optimizers
No ratings yet
Neural Network Architectures & Optimizers
39 pages
Deep Learning Tips for Neural Networks
No ratings yet
Deep Learning Tips for Neural Networks
49 pages
Deep Learning Module 2: Key Concepts & PYQs
No ratings yet
Deep Learning Module 2: Key Concepts & PYQs
30 pages
Understanding Machine Learning Optimizers
No ratings yet
Understanding Machine Learning Optimizers
4 pages
Deep Learning Exam Prep
No ratings yet
Deep Learning Exam Prep
17 pages
Training Neural Networks: Loss Functions
No ratings yet
Training Neural Networks: Loss Functions
34 pages
AEFI Surveillance and Pharmacist Roles
No ratings yet
AEFI Surveillance and Pharmacist Roles
2 pages
Antistat Catalogue Edition 2 Final 1
No ratings yet
Antistat Catalogue Edition 2 Final 1
60 pages
Understanding Crib Walls and Their Applications
No ratings yet
Understanding Crib Walls and Their Applications
19 pages
Siemens AG Annual Financial Report 2022
No ratings yet
Siemens AG Annual Financial Report 2022
144 pages
Gs-Mains-Mini-Test Mih 1
No ratings yet
Gs-Mains-Mini-Test Mih 1
2 pages
Non-Teaching Tasks Overview
No ratings yet
Non-Teaching Tasks Overview
6 pages
Topview Trainer User Guide for 8051
No ratings yet
Topview Trainer User Guide for 8051
93 pages
Understanding the SPC-700 Sound Chip
No ratings yet
Understanding the SPC-700 Sound Chip
10 pages
Instruction Set Overview
No ratings yet
Instruction Set Overview
9 pages
SAS Certification Exam Study Guide
No ratings yet
SAS Certification Exam Study Guide
10 pages
Indenture Labor: History and Abolition
No ratings yet
Indenture Labor: History and Abolition
33 pages
2025 Liu If3.9 7月发表mimic3.1参考研究过程应激性高血糖比值与冠状动脉旁路移植术后患者房颤发生率的关系：基于mimic Iv数据库的回顾性研究
No ratings yet
2025 Liu If3.9 7月发表mimic3.1参考研究过程应激性高血糖比值与冠状动脉旁路移植术后患者房颤发生率的关系：基于mimic Iv数据库的回顾性研究
14 pages
Ejaz Ali Khan: Registered Nurse Profile
No ratings yet
Ejaz Ali Khan: Registered Nurse Profile
3 pages
NACH Debit Mandate Cancellation Form
No ratings yet
NACH Debit Mandate Cancellation Form
1 page
Tech Discovery Guide
No ratings yet
Tech Discovery Guide
11 pages
Storyboard Samples by Mark Simon
No ratings yet
Storyboard Samples by Mark Simon
73 pages
HSM Archive License Agreement
No ratings yet
HSM Archive License Agreement
486 pages
TradeLens: Lessons from Its Shutdown
No ratings yet
TradeLens: Lessons from Its Shutdown
13 pages
EIA Terms for Manganese Mining in Pahang
No ratings yet
EIA Terms for Manganese Mining in Pahang
13 pages
Post-UTME Government Practice Questions
No ratings yet
Post-UTME Government Practice Questions
8 pages
Vertical Tank Support Design Guide
No ratings yet
Vertical Tank Support Design Guide
7 pages
Special Power of Attorney for Corporations
100% (1)
Special Power of Attorney for Corporations
2 pages
Defining Components of Tourist Destinations
No ratings yet
Defining Components of Tourist Destinations
40 pages
Ogk Sop 046 Inflating Tires
No ratings yet
Ogk Sop 046 Inflating Tires
2 pages
B.Tech Project Report Guidelines
No ratings yet
B.Tech Project Report Guidelines
13 pages
MPG A8 200 2018 009A Rev 2109 TECTEG
No ratings yet
MPG A8 200 2018 009A Rev 2109 TECTEG
2 pages
S-Forty-9er QRP Kit User Manual
No ratings yet
S-Forty-9er QRP Kit User Manual
27 pages
LB1845 PWM Motor Driver Overview
No ratings yet
LB1845 PWM Motor Driver Overview
7 pages
Sales and BD Team Structure Overview
No ratings yet
Sales and BD Team Structure Overview
29 pages

Unit2 DeepLearning ComprehensiveNotes

Uploaded by

Unit2 DeepLearning ComprehensiveNotes

Uploaded by

21BME381T

Deep Learning Techniques in Medicine

Training Deep Neural Networks

Optimization Regularization Attention Mechanism Data Strategies

Course 21BME381T — Deep Learning Techniques in Medicine

Unit Unit 2 — Training Deep Neural Networks (9 Hours)

Prepared for Geeta, [Link] Biomedical Engineering, SRM IST

Coverage University Exam + Real-World Projects + Competitive Exams

References Goodfellow et al. | Nielsen | [Link] | Aggarwal | Calin

3. Effective Training Techniques

4. Handling Overfitting & Small Datasets

6. 10 Theory Practice Questions (University Exam)

1.1 Gradient Descent — Full Derivation & Variants

Weight Update Rule W_(t+1) = W_t - eta * gradient_L(W_t)

Gradient w.r.t. W gradient_L(W) = (1/N) * SUM[ dL(y_i, y_hat_i) / dW ]

Learning Rate Effects:

1.2 The Delta Rule & Learning Rates

Delta Rule delta_W = eta * (target - output) * input

Learning Rate Scheduling Strategies:

Step Decay eta = eta_0 * gamma^floor(epoch/step) Simple; fixed milestones

Exponential Decay eta = eta_0 * e^(-lambda*t) Smooth, gradual reduction

Cosine Annealing eta = eta_min + 0.5*(eta_max-eta_min)*(1+cos(pi*t/T))

Warm Restarts Cosine + periodic reset Escaping local minima

1.3 Batch, Stochastic & Mini-Batch Gradient Descent

Property Batch GD Stochastic (SGD) Mini-Batch GD

Samples per update All N samples 1 sample B samples (32–512)

Gradient quality Exact, stable Very noisy Good approximation

Memory usage High Very low Moderate

Parallelisation (GPU) Good Poor Excellent

Used in practice? Rarely Rarely (pure) Almost always

Key Insight: Why Mini-Batch Wins

1.4 Adaptive Moment Estimation (Adam)

Adam Algorithm — Step by Step:

Step 3: Update 2nd

Step 4: Bias m_hat_t = m_t / (1 - beta_1^t), v_hat_t = v_t / (1 -

What does Adam actually do?

Adam vs SGD — Which to Use?

Cache update E[g^2]_t = rho * E[g^2]_(t-1) + (1-rho) * g_t^2

Weight update W_t = W_(t-1) - (eta / sqrt(E[g^2]_t + epsilon)) * g_t

2.1 The Four Fundamental Equations (Nielsen, 2015)

BP1 — Output layer error

Formula delta^L_j = (dL/da^L_j) * sigma'(z^L_j)

BP2 — Backpropagate error

delta^l = ((W^{l+1})^T * delta^{l+1}) elementwise_product

BP3 — Gradient of bias

Formula dL/db^l_j = delta^l_j

BP4 — Gradient of weight

Formula dL/dw^l_jk = a^{l-1}_k * delta^l_j

2.2 Full Backpropagation Algorithm

3.1 Early Stopping

3.2 Dropout (Srivastava et al., 2014)

Dropout mask r_j ~ Bernoulli(p) — r_j is 1 with prob p, 0 otherwise

Masked activation a_tilde^l = r^l elementwise_product a^l

Test time scaling W_test = p * W_train (multiply weights by keep-prob p)

3.3 Batch Normalization (Ioffe & Szegedy, 2015)

BatchNorm Algorithm (per mini-batch B):

Step 1: Batch mean mu_B = (1/m) * SUM x_i (for i in batch B)

Step 3: Normalise x_hat_i = (x_i - mu_B) / sqrt(sigma^2_B + epsilon)

Why gamma and beta (scale & shift)?

3.4 & 3.5 Instance, Group Normalization — Comparison

Normalization Normalises Over Best For Batch Size

Batch Norm Batch + spatial Classification, High

Layer Norm All channels + NLP, Transformers, None

Instance Norm Spatial only per Style transfer, None

Group Norm Channels in groups Object detection, None

4.1 Overfitting — Bias-Variance Tradeoff

Decomposition Expected Error = Bias^2 + Variance + Irreducible Noise

High Bias (Underfitting) High Variance (Overfitting)Ideal

Train error High Low Low

Val error High High Low

Train-Val gap Small Large Small

Model Too simple Too complex Just right

Fix More capacity, features Regularize, more data N/A

4.2 Data Augmentation

Standard Image Augmentations:

Class Imbalance Solutions:

Cosine Annealing eta = eta_min + 0.5(eta_max-eta_min)(1+cos(pi*t/T))

Label Smoothing CE -SUM [(1-eps)y_i + eps/K] log(p_i) Overconfidence; small data

Contrastive Loss yd^2 + (1-y)max(m-d,0)^2 Similarity learning

Score function Q * K^T / sqrt(d_k) v^T * tanh(W_qq + W_kk)

Each head head_i = Attention(QW_Qî, KW_Kî, V*W_Vî)

Additive Attn score(q,k) = v^T * tanh(W_qq + W_kk)