0% found this document useful (0 votes)
6 views30 pages

Deep Learning Notes Complete

This comprehensive study guide covers various modern neural network architectures including CNNs, RNNs, LSTMs, GRUs, Transformers, and Mixture of Experts (MoE), detailing their functions, applications, and limitations. It provides insights into how to choose the right architecture based on input data type and size, emphasizing the importance of understanding each model's strengths and weaknesses. The document serves as a resource for active recall and interview preparation in the field of deep learning.

Uploaded by

ipsita lahiri
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

Deep Learning Notes Complete

This comprehensive study guide covers various modern neural network architectures including CNNs, RNNs, LSTMs, GRUs, Transformers, and Mixture of Experts (MoE), detailing their functions, applications, and limitations. It provides insights into how to choose the right architecture based on input data type and size, emphasizing the importance of understanding each model's strengths and weaknesses. The document serves as a resource for active recall and interview preparation in the field of deep learning.

Uploaded by

ipsita lahiri
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning Architecture

A Comprehensive Study Guide

CNN · RNN · LSTM · GRU · Transformers · MoE

These notes consolidate everything across multiple sources on modern neural network architectures —
what they are, how they work, when to use them, and how they compare. Written for active recall and
interview prep.
1. Foundations — What Is a Neural Network?
A neural network is a computational model loosely inspired by how the human brain processes
information. More precisely, it is a function approximator — a mathematical system that learns a
mapping from inputs to outputs by adjusting millions of internal parameters (weights) based on data.

Every neural network, regardless of architecture, has three fundamental components:

Layer Role Example

Input Layer Receives raw data — images, numbers, text tokens, 21 landmark (x,y) pairs →
coordinates 42 numbers

Hidden Layers Where learning happens. Each layer transforms the LSTM layers,
previous layer's output into increasingly abstract convolutional layers
representations

Output Layer Produces the final answer — a class score, probability 6 gesture class scores via
distribution, or continuous value softmax

Dense vs Sparse Networks


Not all neural networks use all their parameters for every input. This distinction matters enormously for
efficiency.

Dense Networks Sparse Networks

Every neuron connects to every other neuron. All Only a subset of neurons activate for any given
parameters activate for every input. Like turning on input. Like turning on lights only in rooms you're
every light in a building. using.

High energy consumption. Thorough but slower. Much more energy efficient. Scales to massive
sizes.

Example: Standard fully connected layers Example: GPT-4 uses sparse attention. Mixtral
8x7B has 47B capacity at 13B compute cost — 4x
efficiency.
2. Convolutional Neural Networks (CNN)
One-line summary: CNN = Image Expert. Specialized filters that scan spatial data looking for patterns.

What Problem Does CNN Solve?


Before CNNs, image classification required manually engineering features (edge detectors, color
histograms). CNNs learn these features automatically from data by applying small learnable filters across
the image.

The key insight: a cat's ear looks the same whether it's in the top-left or bottom-right of the image. CNNs
exploit this translational invariance by sharing filter weights across the entire image.

How CNNs Work — Step by Step


• Convolution: A small filter (e.g. 3×3 pixels) slides across the image. At each position it computes a
dot product with the image patch beneath it. This produces a feature map showing where that pattern
is present.
• Activation (ReLU): Apply ReLU (max(0,x)) to introduce non-linearity. Kills negative values, keeps
positive signals.
• Pooling: Shrink the feature map (e.g. take the max value in each 2×2 region). Reduces spatial
dimensions while keeping important features. Makes the network robust to small translations.
• Repeat: Stack multiple conv+pool layers. Early layers detect edges and textures. Later layers detect
shapes, then objects.
• Flatten + Classify: After enough conv layers, flatten the feature maps into a vector and pass through
fully connected layers to produce class scores.

Why NOT CNN for Our Project?


Our gesture project uses MediaPipe landmark coordinates — 21 (x,y) pairs, not raw pixels. CNNs are
designed to find spatial patterns in 2D pixel grids. Our input has no spatial grid structure — there's nothing
to convolve over. Feeding coordinates into a CNN would be like using a hammer to turn a screw.

Real-World Applications
• Face recognition in your phone's unlock system
• Instagram/Snapchat filters — real-time face landmark detection
• Medical imaging: X-ray analysis, tumour detection, MRI interpretation
• Self-driving cars: object detection, lane recognition, traffic sign classification
• Google Photos: automatic scene and object tagging
• Quality control in manufacturing: defect detection on production lines
Limitations
• Poor at sequential/temporal data — no memory of previous inputs
• Requires large labelled datasets to train effectively
• Computationally expensive for very high-resolution images
3. Recurrent Neural Networks (RNN)
One-line summary: RNN = Sequence & Memory Network. Processes data step by step while
maintaining a hidden state that carries information forward.

What Problem Does RNN Solve?


Standard feedforward networks (FNNs) and CNNs treat each input independently. They have no notion of
time or order. But many real problems are inherently sequential — the meaning of a word depends on the
words before it; a stock price depends on its history; a gesture depends on how the hand moved over time.

RNNs solve this by maintaining a hidden state — a vector that gets updated at each timestep and carries
information from previous steps into the current one.

How RNNs Work


At each timestep t, the RNN takes two inputs: the current input xt and the hidden state from the previous
step ht-1. It produces a new hidden state ht and optionally an output yt.

ht = tanh(Wh · ht-1 + Wx · xt + b)

The same weights (Wh, Wx, b) are shared across all timesteps. This is what makes it 'recurrent' — the
same function applied repeatedly with its own output fed back in.

The Vanishing Gradient Problem — Why Basic RNNs Fail


During backpropagation through time (BPTT), gradients are multiplied together at each timestep. If these
values are less than 1 (which tanh produces), the gradient shrinks exponentially as it propagates back
through time.

After 20-30 timesteps, the gradient becomes effectively zero — the network can no longer learn
dependencies from that far back. This is called the vanishing gradient problem. For long sequences like
a paragraph of text or a long piece of music, vanilla RNNs simply forget the beginning by the time they
reach the end.

Real-World Applications
• Autocomplete and text prediction (early versions)
• Stock price prediction over short windows
• Early speech recognition systems
• Weather forecasting over short time horizons

Limitations
• Vanishing gradient — forgets long-range dependencies
• Sequential processing prevents parallelization — slow to train
• Struggles with sequences longer than ~30 steps
• Can be unstable during training
4. Long Short-Term Memory (LSTM)
One-line summary: LSTM = Long-Term Memory Expert. An advanced RNN that uses gating
mechanisms to selectively remember and forget, solving the vanishing gradient problem.

What Problem Does LSTM Solve?


LSTM was designed specifically to fix the vanishing gradient problem of vanilla RNNs. Introduced by
Hochreiter & Schmidhuber in 1997, it adds a second memory mechanism called the cell state — a
separate 'conveyor belt' of information that runs through the entire sequence with only minor linear
interactions, allowing gradients to flow unchanged over hundreds of timesteps.

The Three Gates — Heart of LSTM


LSTMs control information flow through three learned gates. Each gate is a sigmoid layer (output 0–1) that
acts as a soft switch.

Gate What it controls Intuition

Forget Gate ft How much of the previous cell state to keep. Reading about a new character
Output 0 = forget everything, 1 = keep in a story — forget the old
everything. character's gender/name.

Input Gate it Which new information to write into the cell The new character's name and
state. role gets stored.

Output Gate ot Which part of the cell state to expose as the Only output what's relevant for
hidden state output. predicting the next word.

Why LSTM for Our Gesture Project


Our swipe gestures are sequences of 20 frames of landmark coordinates. A swipe_up looks like: wrist
y=0.85, 0.80, 0.73, 0.65, 0.55, 0.44... — a smooth downward trend in y value over time. The LSTM
processes these 20 timesteps sequentially, building up a hidden state that captures the trajectory. By the
20th frame, that hidden state is a compressed summary of the entire movement pattern — which the final
fully connected layer maps to one of 6 gesture classes.

We got 97.9% validation accuracy on 1,205 samples with a 2-layer LSTM (hidden_size=128,
dropout=0.3) trained for 50 epochs.

Real-World Applications
• Netflix automatic subtitle generation
• Siri and Alexa early speech recognition layers
• Algorithmic trading systems — market movement prediction
• Machine translation (pre-transformer era)
• Sentiment analysis on long text
• Options pricing models (your project — LSTM on historical option contract data)

Limitations
• More complex than vanilla RNNs — slower to train
• Still sequential — can't be fully parallelized
• Struggles with very long sequences (1000+ steps) — transformers do better here
• Higher memory consumption due to cell state + hidden state
5. Gated Recurrent Unit (GRU)
One-line summary: GRU = Simplified LSTM. Combines the forget and input gates into a single update
gate — fewer parameters, faster training, often similar performance.

How GRU Differs from LSTM


GRU was proposed by Cho et al. in 2014. It simplifies LSTM by merging the cell state and hidden state
into one, and reducing three gates to two:

• Update Gate: Decides how much of the past information to keep vs how much of the new information
to write. Combines LSTM's forget + input gates.
• Reset Gate: Decides how much of the past hidden state to expose when computing the new
candidate hidden state. Controls how much past context influences the new update.

LSTM vs GRU — When to Choose Which


Choose LSTM when Choose GRU when

Complex, long sequences where accuracy matters Computational efficiency matters, or dataset is
most smaller

You have enough compute and training time You want faster iteration / experimentation

Tasks with very long-range dependencies Similar performance to LSTM is acceptable with
fewer parameters
6. Transformers
One-line summary: Transformers = Attention Is All You Need. Processes entire sequences
simultaneously using self-attention instead of sequential hidden states. Powers ChatGPT, Claude,
BERT, GPT-4.

The Revolution — 2017 'Attention Is All You Need'


Before 2017, state-of-the-art sequence models were LSTMs. They worked well but had a fundamental
bottleneck: sequential processing. To understand word 100 in a sentence, you had to process words 1
through 99 one at a time. This prevented parallelization and made training on large datasets extremely
slow.

Vaswani et al. (Google Brain, 2017) threw out the recurrence entirely and replaced it with self-attention —
a mechanism that allows every position in the sequence to directly attend to every other position
simultaneously, regardless of distance.

Self-Attention — The Core Idea


For each token (word/position) in the sequence, self-attention computes three vectors: a Query (what am I
looking for?), a Key (what do I contain?), and a Value (what do I contribute?).

The attention score between two tokens is computed as: softmax(Q · KT / √dk) · V. The result: each token
gets a weighted sum of all other tokens' values, weighted by how relevant they are.

Example: In 'The cat sat on the mat', when processing 'sat', self-attention simultaneously looks at 'cat'
(who's sitting?) and 'mat' (sitting on what?) — not step by step, but all at once.

Multi-Head Attention
Instead of one set of Q/K/V projections, transformers use multiple 'heads' in parallel — each learning to
attend to different types of relationships simultaneously. One head might learn syntactic structure, another
semantic similarity, another co-reference. Their outputs are concatenated and projected.

Positional Encoding
Since transformers process all positions in parallel, they have no inherent sense of order. Positional
encodings (sinusoidal functions of position) are added to the input embeddings to inject order information.

Why Transformers Won


• Parallelizable — entire sequence processed at once on GPU. Training is orders of magnitude faster
than LSTMs.
• Better at long-range dependencies — direct attention between any two positions, no gradient decay
over distance.
• Scales beautifully — more data + more compute = better models. LSTMs plateau; transformers keep
improving.
• Transfer learning works amazingly — pretrain on massive data, fine-tune on small task-specific data.

What Powers What


• ChatGPT, Claude, GPT-4 — Transformer (GPT = Generative Pre-trained Transformer)
• Google Search — BERT (Bidirectional Encoder Representations from Transformers)
• GitHub Copilot — Transformer fine-tuned on code
• Google Translate (current) — Transformer encoder-decoder
• Grammarly — Transformer for grammar/style

Limitations
• Quadratic attention cost — self-attention is O(n²) in sequence length. 1000-token sequence = 1M
attention scores.
• Requires enormous compute and memory for training
• High-quality data dependency — more sensitive to data quality than LSTMs
• Overkill for simple tasks — don't use a transformer to classify 6 gestures from 1,205 samples
7. Mixture of Experts (MoE)
One-line summary: MoE = Team of Specialists. Multiple expert subnetworks + a router that decides
which 1-2 experts handle each input. Massive model capacity at fraction of compute cost.

The Core Idea


Instead of one large dense model that activates all parameters for every input, MoE has many 'expert'
networks and a gating/routing network that selects which experts to use for each token.

• Input arrives — a question or piece of text needs processing


• Router decides — a gating network selects 1-2 experts most suited for this input
• Experts process — only the selected experts activate; the rest stay dormant
• Combine results — outputs are weighted and combined into the final answer

Why This Matters — The Economics


Mixtral 8x7B has 8 expert networks but only activates 2 at a time. Total parameters: 47B. Active
parameters per token: ~13B. This means it has the reasoning capacity of a 47B model but the inference
cost of a 13B model — a 4x efficiency gain.

This is why GPT-4 can be simultaneously great at coding, creative writing, math, and science. A dense
model of equivalent quality would cost ~10x more to run.

Real MoE Models


• Mixtral 8x7B — Mistral AI, open source, 8 experts 2 active
• Switch Transformer — Google, landmark MoE paper
• DBRX — Databricks
• DeepSeek-v2 — strong open-source MoE
• GPT-4 — suspected MoE architecture (unconfirmed by OpenAI)
8. Master Comparison Table

Parameter CNN RNN LSTM GRU Transformer

Primary Use Images / Short Long Efficient NLP, vision,


spatial sequences sequences sequences everything

Processing Spatial Sequential Sequential Sequential Fully parallel


(parallel over with memory with memory
positions)

Long-Range No Poor Good (cell Good Excellent (direct


Deps (vanishing state) attention)
grad)

Training Fast Slow Slower than Faster than Fast (parallel


Speed RNN LSTM GPU)

Memory Medium Low High (2 states) Medium Very high


Usage (attention maps)

Parameters Medium Few More (3 gates) Fewer (2 Massive


gates) (multi-head)

Scalability High Low Moderate Moderate Very high

Best For Face unlock, Simple text Options Same as ChatGPT,


medical prediction pricing, LSTM, faster BERT, code
imaging speech, generation
translation
9. How to Choose the Right Architecture
The most important skill isn't knowing what each architecture is — it's knowing which one to reach for
given a problem. Use this 4-question framework:

Question 1: What is your input data?


• Images / video frames / spatial grids → Start with CNN
• Short sequences (text, time series, <100 steps) → RNN or LSTM
• Long sequences (>100 steps) or complex temporal patterns → LSTM or GRU
• Very long sequences (1000+ tokens) or language understanding → Transformer
• Mixed/multimodal data → Probably need multiple architectures combined
• Clean coordinate data (like our landmarks) → LSTM wins easily over CNN

Question 2: How much data do you have?


• Small (<10K samples) → Use pre-trained models + transfer learning. Don't train from scratch.
• Medium (10K–1M) → Fine-tune existing architectures. LSTM/GRU work well here.
• Large (>1M) → Train from scratch is feasible. Transformers shine.
• Very large (>100M) → Custom architectures, MoE, serious infrastructure needed.

Question 3: What is your compute budget?


• Limited (laptop CPU) → CNNs for vision, GRUs for sequences. Avoid transformers.
• Medium (single GPU) → LSTMs, smaller transformers (BERT-tiny), fine-tuned models.
• Large (multi-GPU) → Full transformers, consider MoE.
• Enterprise (100s of GPUs) → Custom MoE architectures, foundation model training.

Question 4: What are your latency requirements?


• Real-time (milliseconds) → Optimized CNNs, small LSTMs, quantized models. Transformers too
slow for most real-time.
• Interactive (under 1 second) → Most architectures work. LSTM at 28fps on CPU is fine (our
project).
• Batch processing (minutes okay) → Use the most accurate option regardless of speed.

Common Beginner Mistakes


Mistake 1: 'Transformers solve everything'
Reality: They're overkill for simple tasks and expensive to run. We got 97.9% accuracy with an LSTM on
1,205 samples. A transformer would need 10x more data and 100x more compute for the same result.
Mistake 2: 'I need to build everything from scratch'
Reality: Use pre-trained models and transfer learning. MediaPipe's hand landmark model is pre-trained
by Google. We built on top of it, not instead of it.

Mistake 3: 'More parameters = better results'


Reality: More parameters = more data requirements, more compute cost, and often worse
generalization on small datasets.
10. Our Project Stack — Gesture-Controlled Game
This section maps every concept from these notes to the actual code written for the
GestureControlledGame project.

Full Pipeline
Stage What happens Architecture used File

1. Capture Webcam frame read as NumPy array OpenCV [Link]


(H×W×3 BGR)

2. Detect MediaPipe runs pre-trained CNN-like MediaPipe (Google All files


model to find hand in frame, returns 21 pre-trained)
(x,y,z) landmarks

3. Collect Save 20-frame sequences of 42 N/A (data collection) collect_data.py


landmark numbers each with gesture
label to CSV

4. Train 2-layer LSTM processes (20, 42) LSTM (PyTorch) train_lstm.py


sequences → final hidden state → FC
layer → 6 class scores. 97.9% val acc.

5. Predict Sliding window of last 20 frames → LSTM inference predict_live.py


normalize → LSTM → softmax →
gesture label if conf > 85%

6. Game Gesture → snake direction. DQN RL DQN (Reinforcement snake_game.py (TBD)


(WIP) agent plays autonomously in AI mode. Learning)

Resume-Level Description
"Built a real-time gesture-controlled Snake game using MediaPipe hand tracking, a 2-layer LSTM
sequence classifier (97.9% val accuracy, 1,205 training samples, 6 gesture classes), and a DQN
reinforcement learning agent trained to play Snake autonomously. Pipeline: webcam → landmark
extraction → sliding-window LSTM inference → game control, running at 28 FPS on CPU."
11. Quick Reference — Interview Cheat Sheet

Question Answer

What's the vanishing gradient During backprop through time, gradients are multiplied at each step.
problem? If <1, they shrink exponentially. By step 30+, gradient ≈ 0 and the
network can't learn long-range dependencies. LSTM solves this
with the cell state highway.

How does LSTM solve vanishing The cell state flows through the network with only additive
gradient? interactions (not multiplicative). Gradients flow back through
addition, not multiplication — so they don't vanish. The forget gate
controls how much to preserve.

CNN vs RNN in one sentence? CNN finds spatial patterns in fixed-size grids (images). RNN finds
temporal patterns in variable-length sequences (time series, text).
CNN has no memory; RNN does.

Why transformers over LSTMs for LSTMs are sequential — can't be parallelized. Transformers
NLP? process all positions simultaneously via self-attention. This enables
training on massive datasets in reasonable time. Also better at very
long-range dependencies.

What is self-attention? Each token computes Query, Key, Value vectors. Attention score =
softmax(Q·K^T / √d_k). Output = weighted sum of all Values. Every
token directly attends to every other token — O(n²) but
parallelizable.

LSTM vs GRU? GRU merges cell state + hidden state and uses 2 gates instead of
3. Fewer parameters, faster training, often similar accuracy. Choose
LSTM for accuracy-critical tasks; GRU for efficiency.

What is MoE? Multiple specialist networks + a router. For each input, router
selects 1-2 experts. Others stay dormant. Enables massive model
capacity (47B params) at fraction of compute cost (13B active).
Powers GPT-4 (suspected), Mixtral.

Why LSTM not CNN for gestures? Our input is 21 (x,y) coordinate pairs — not pixel grids. CNN
convolves over spatial structure that doesn't exist here. LSTM is the
right tool: it learns temporal trajectory patterns from the sequence of
coordinate frames.

Deep Learning Notes · GestureControlledGame Project · IITB · 2026


Deep Learning Architecture
Supplement: Foundations, Training & Modern Techniques
Backprop · Loss Functions · Optimizers · Regularization · VAEs · GANs · Diffusion · GNNs · RLHF · LoRA · UAT

These sections extend the original notes with the five missing pillars of deep learning: mathematical
fundamentals, training methodology, additional architectures, modern fine-tuning techniques, and theoretical
grounding. Same format — written for active recall and interview prep.
Section 12

Fundamentals
Backpropagation · Loss Functions · Optimizers · Activation Functions

12.1 — Backpropagation
Backpropagation is the algorithm that makes neural networks learn. It computes the gradient of the loss with
respect to every weight in the network by applying the chain rule of calculus layer by layer, from output back to
input.

The forward pass:

• Input flows through each layer: z = Wx + b, then a = activation(z).


• At the output, a loss function compares predictions to ground truth.
The backward pass (backprop):

• Compute ∂L/∂output. Then propagate gradients backwards through each layer.


• Chain rule: ∂L/∂W = ∂L/∂a · ∂a/∂z · ∂z/∂W
• Each weight receives its gradient: how much it contributed to the error.
Weight update (gradient descent):

W ← W − η · ∂L/∂W where η = learning rate

Computational graphs: Modern frameworks (PyTorch, TensorFlow) build a dynamic graph of operations during
the forward pass, then traverse it in reverse during backprop. This is why PyTorch's autograd just works — every
operation is recorded.

12.2 — Loss Functions


Loss Function When to Use / Formula

Mean Squared Error (MSE) Regression. L = (1/n)Σ(y − ■)². Penalises large errors heavily.

Cross-Entropy Classification. L = −Σ y·log(■). Matches softmax output.

Binary Cross-Entropy Binary classification. L = −[y·log(■) + (1−y)·log(1−■)]

Categorical Cross-Entropy Multi-class. Standard loss for CNNs, LSTMs with softmax.

KL Divergence VAEs, distributional matching. KL(P||Q) = Σ P·log(P/Q)

Huber Loss Regression robust to outliers. L1 for large errors, L2 for small.

Contrastive / Triplet Metric learning, embeddings. Pulls similar pairs together.

12.3 — Optimizers
The optimizer decides how to update weights using the computed gradients. Vanilla gradient descent uses a fixed
step in gradient direction — but real training is more nuanced.

Optimizer Key Idea & Formula

SGD (Stochastic GD) Update per mini-batch, not full dataset. Fast but noisy. W ← W − η·∇L

SGD + Momentum Accumulates velocity v in gradient direction. Reduces oscillation. v ← βv + ∇L; W ← W −


RMSProp Scales η by running average of squared gradients. Handles sparse gradients well.

Adam (Adaptive Moment) Combines momentum + RMSProp. Keeps running mean (m) and variance (v) of gradien

AdamW Adam + weight decay (L2 regularisation applied correctly). Preferred for transformers (G

Learning Rate Schedulers Reduce η over training: StepLR (decay every k epochs), CosineAnnealing, OneCycleLR
Interview tip: Adam is the default starting point. Use AdamW for transformers. SGD + momentum often achieves
better final accuracy if you tune carefully (e.g. ResNet ImageNet training).

12.4 — Activation Functions


Activation functions introduce non-linearity. Without them, stacking layers is mathematically equivalent to a single
linear layer — the network can only learn linear mappings.

Activation Formula & Properties

Sigmoid σ(x) = 1/(1+e■■). Output: (0,1). Saturates → vanishing gradients. Use only in output for

Tanh tanh(x) = (e■−e■■)/(e■+e■■). Output: (−1,1). Still saturates but zero-centred. Used in

ReLU max(0,x). Fast, sparse activation. Problem: 'dying ReLU' — neurons stuck at 0 if input al

Leaky ReLU max(0.01x, x). Fixes dying ReLU by allowing small negative gradient.

ELU x if x>0 else α(e■−1). Smooth, negative saturation. Better than Leaky ReLU in some cas

GELU x·Φ(x). Gaussian-gated. Used in BERT, GPT, transformers. Smoother than ReLU.

Swish x·σ(x). Self-gated. Used in EfficientNet, newer models.

Softmax e■■/Σe■■. Converts logits to probability distribution. Always use at output for multi-clas
Section 13

Training Concepts
Overfitting · Regularization · Data Splits · Augmentation · Hyperparameter Tuning

13.1 — Overfitting, Underfitting & the Bias-Variance Tradeoff


The central challenge of ML is generalisation — performing well on data the model hasn't seen. The bias-variance
tradeoff describes the two failure modes:

<b>Underfitting (High Bias)</b> <b>Overfitting (High Variance)</b>

What happens Model too simple. Fails even on training data.


Model memorises training data. Fails on test data.

Train loss High Very low

Val/Test loss High Much higher than train

Fix More capacity, more epochs, better features


Regularisation, more data, dropout, early stopping

Example Linear model on non-linear data 10-layer network on 100 samples


Sweet spot: You want low bias and low variance — a model complex enough to capture the real pattern but not so
complex it memorises noise. Validation loss curve is your compass.

13.2 — Regularisation Techniques


Technique How It Works

L2 / Weight Decay Adds λ·||W||² to loss. Penalises large weights, keeps them small. Equivalent to Gaussian

L1 Regularisation Adds λ·||W||■ to loss. Promotes sparsity — many weights go exactly to 0. Useful for fea

Dropout During training, randomly zero out neurons with probability p (typically 0.2–0.5). Forces r

Batch Normalisation Normalise activations across the mini-batch: (x−µ)/σ, then scale/shift with learned γ,β. St

Layer Normalisation Same idea but normalise across features instead of batch. Used in transformers (batch s

Early Stopping Monitor validation loss. Stop training when val loss stops improving. Saves the checkpoi

Data Augmentation Artificially expand training set (see 13.3). Prevents memorisation of specific examples.

13.3 — Data Augmentation


Data augmentation applies random transformations to training examples to artificially expand the dataset. The
model sees more variety without collecting new data.

For images (CV tasks):

• Random horizontal/vertical flip


• Random crop and resize (zoom simulation)
• Colour jitter: brightness, contrast, saturation, hue shifts
• Random rotation (±15°)
• Gaussian noise, blur
• Cutout / Random Erasing: mask rectangular patches to force robustness
• Mixup: interpolate two training images and their labels linearly
• CutMix: paste a patch from image B into image A, mix labels proportionally
For sequences / NLP tasks:

• Synonym replacement: swap words with semantically similar alternatives


• Random insertion / deletion of words
• Back-translation: translate to another language and back
• Token masking (used in BERT pre-training: mask 15% of tokens)
• Time-series: add Gaussian noise, time-warp, window slicing

13.4 — Train / Validation / Test Splits


Proper data splitting is critical to getting honest performance estimates.

Split Purpose & Typical Size

Training set (60–80%) Model sees this data and updates weights from it.

Validation set (10–20%) Used to tune hyperparameters, choose architecture, do early stopping. Model never train

Test set (10–20%) Evaluated once at the very end to report final performance. Never use for any decisions
Cross-validation: When data is scarce, use k-fold CV. Split data into k folds, train on k−1 folds, validate on the
remaining fold. Rotate k times and average results. 5-fold and 10-fold are most common. Gives more reliable
estimates but is k× slower.

Stratified splits: For imbalanced datasets, ensure each split has the same class distribution as the full dataset.
Use sklearn's StratifiedKFold.

13.5 — Hyperparameter Tuning


Method How It Works

Grid Search Try every combination of a predefined parameter grid. Exhaustive but expensive. Only fe

Random Search Sample random combinations. Empirically as good as grid search in fewer tries (Bergstr

Bayesian Optimisation Build a probabilistic model of the loss landscape. Uses past results to choose next trials

Learning Rate Finder Sweep LR from very low to very high over a few batches. Plot loss vs LR. Pick LR just b

Population-Based Training Train many models in parallel. Periodically copy weights from better-performing models
Key hyperparameters to tune: learning rate (most important), batch size, network depth/width, dropout rate,
weight decay, number of epochs.
Section 14

Additional Architectures
Autoencoders · VAEs · GANs · Diffusion · GNNs · ResNets · U-Net

14.1 — Autoencoders
One-line summary: Compress input into a small representation (latent code), then reconstruct it. Learn to keep
only the essential information.

An autoencoder has two parts: an Encoder that maps input X → latent vector z (a bottleneck), and a Decoder that
maps z → X■ (reconstruction). Trained with reconstruction loss (MSE or cross-entropy between X and X■).

Variant Key Idea

Vanilla Autoencoder Deterministic encoder. Latent space has no structure — you can't sample from it.

Denoising Autoencoder Add noise to input, train to reconstruct the clean version. Forces robust representations.

Sparse Autoencoder Add sparsity penalty to latent layer. Most neurons inactive. Learns disentangled features

Variational Autoencoder (VAE) Encoder outputs µ and σ. Sample z ~ N(µ,σ). Structured latent space → generative mod

14.2 — Variational Autoencoders (VAE)


One-line summary: VAE = Autoencoder that learns a structured, continuous latent space you can sample from to
generate new data.

The key innovation: instead of encoding to a single point, the encoder outputs a mean µ and variance σ². During
training, z is sampled from N(µ, σ²) using the reparameterisation trick: z = µ + σ·ε where ε ~ N(0,1). This makes
gradients flow through the sampling operation.

VAE Loss (ELBO):

L = Reconstruction Loss + β · KL(N(µ,σ²) || N(0,1))

The KL term regularises the latent space toward a standard normal. This is what makes it smooth and interpolable
— nearby points in latent space decode to similar outputs.

Applications: Image generation, anomaly detection (high reconstruction error = anomaly), molecule generation in
drug discovery, face interpolation.

14.3 — Generative Adversarial Networks (GANs)


One-line summary: Two networks in competition — a Generator that creates fake data and a Discriminator that
tries to spot the fakes. They improve each other through adversarial training.

The training loop:

1. Generator G takes random noise z and produces fake image G(z).


2. Discriminator D takes real images and G(z), outputs probability of being real.
3. D is trained to output 1 for real, 0 for fake.
4. G is trained to make D output 1 for its fakes (fooling D).
5. Repeat until G produces indistinguishable images.
Minimax objective:
min_G max_D E[log D(x)] + E[log(1 − D(G(z)))]

GAN Variant Innovation

DCGAN Deep Convolutional GAN. First stable image GAN. Uses batch norm, no FC layers.

StyleGAN / StyleGAN2 NVIDIA. Photorealistic faces. Separates style at each resolution level.

CycleGAN Image-to-image translation without paired data. Horse ↔ Zebra, Photo ↔ Painting.

Conditional GAN (cGAN) Generator and Discriminator conditioned on class label. Control what gets generated.

Pix2Pix Paired image translation. Sketch → photo, satellite → map.


GAN training challenges: Mode collapse (G generates only a few types of output), vanishing gradients in D,
training instability. Requires careful hyperparameter tuning.

14.4 — Diffusion Models


One-line summary: Gradually add Gaussian noise to data until it becomes pure noise, then train a neural network
to reverse this process. Stable Diffusion, DALL-E 2, Imagen all use this.

Forward process (fixed, not learned):

q(x_t | x_{t-1}) = N(√(1−β_t)·x_{t-1}, β_t·I) for t = 1…T

Noise is added gradually over T steps (typically T=1000). x_T is approximately pure Gaussian noise N(0,I).

Reverse process (learned):

A U-Net (see 14.7) is trained to predict the noise ε added at each step, then subtract it: p_θ(x_{t-1} | x_t) =
N(µ_θ(x_t, t), Σ_θ).

Why diffusion beats GANs for image quality:

• Training is stable — no adversarial game, just noise prediction (regression).


• Covers the full data distribution — no mode collapse.
• Can be guided by text embeddings (CLIP) for text-to-image generation.
Key models: DDPM (Ho et al., 2020), Stable Diffusion (Rombach et al., latent diffusion), DALL-E 2 (OpenAI),
Imagen (Google), Midjourney.

14.5 — Graph Neural Networks (GNNs)


One-line summary: Neural networks that operate on graph-structured data — nodes, edges, and their
relationships. CNNs and LSTMs assume grids or sequences; GNNs handle arbitrary connectivity.

Core operation: message passing. Each node aggregates information from its neighbours, updates its own
representation, and repeats for k layers.

h_v^(k) = UPDATE(h_v^(k-1), AGGREGATE({h_u^(k-1) : u ∈ N(v)}))

GNN Variant Key Idea

GCN (Graph Conv Net) Spectral graph convolution. Approximated as mean of neighbour features.

GraphSAGE Samples fixed neighbourhood, concatenates own + aggregated features.

GAT (Graph Attention Net) Attention weights on edges — different neighbours contribute differently.

GIN (Graph Isomorphism Net) Provably as powerful as the Weisfeiler-Leman graph isomorphism test.
Applications: Drug discovery (molecules = graphs), social network analysis, recommendation systems,
knowledge graphs, traffic prediction, protein structure prediction (AlphaFold uses attention similar to GNNs).

14.6 — ResNets & Skip Connections


One-line summary: Add shortcut connections that skip one or more layers. Allows training networks 100+ layers
deep by giving gradients a highway back to early layers.

Without skip connections, very deep networks suffer from degradation — adding more layers makes accuracy
worse, even on training data (not overfitting — the network simply can't learn the identity mapping through many
non-linear layers).

Residual block:

F(x) = H(x) − x → Output = F(x) + x (skip connection)

The network only needs to learn the residual F(x) = H(x) − x. If the optimal mapping is near-identity, learning F(x) ≈
0 is much easier than learning H(x) ≈ x through non-linear layers.

Why gradients flow: The skip connection adds the input directly to the output, so ∂L/∂x = ∂L/∂output · (1 + ∂F/∂x).
The +1 term ensures gradients never vanish completely, even across 100+ layers.

ResNet variants: ResNet-18/34/50/101/152 (original), Wide ResNets (more channels), ResNeXt (grouped
convolutions), DenseNet (connect every layer to all subsequent layers).

14.7 — U-Net
One-line summary: Encoder-decoder CNN with skip connections between encoder and decoder at each
resolution level. Designed for image segmentation — outputs a pixel-wise label map.

The U-shape comes from: the encoder (left side) downsamples the image through conv+pool layers, capturing
semantic content. The decoder (right side) upsamples back to the original resolution. Skip connections
concatenate encoder feature maps to the corresponding decoder level, preserving fine spatial detail that
downsampling would otherwise lose.

Application Why U-Net

Medical image segmentation Original use case — segment tumours in MRI/CT scans.

Satellite image segmentation Identify buildings, roads, water bodies pixel by pixel.

Diffusion model denoiser The noise-prediction network in DDPM/Stable Diffusion is a U-Net.

Cell segmentation Track individual cells in microscopy images.


Section 15

Modern Techniques
RLHF · LoRA · Quantization · Embeddings · Contrastive Learning

15.1 — RLHF: How ChatGPT Was Fine-Tuned


One-line summary: RLHF (Reinforcement Learning from Human Feedback) aligns language models with human
preferences by training a reward model on human comparisons, then using PPO to optimise the LLM against that
reward.

The three-stage pipeline:

Stage 1 — Supervised Fine-Tuning (SFT): Start with a pretrained LLM (e.g. GPT-3.5). Fine-tune it on a curated
dataset of high-quality prompt-response pairs written by human trainers. This teaches the model the desired format
and basic helpfulness.

Stage 2 — Reward Model Training: Show human labellers pairs of model responses to the same prompt. They
rank which response is better. Train a separate reward model (another LLM with a scalar output head) to predict
these human preferences.

Stage 3 — PPO Fine-Tuning: Use the reward model as a proxy for human preference. Fine-tune the SFT model
with PPO (Proximal Policy Optimisation) to maximise reward. A KL divergence penalty against the SFT model
prevents the model from drifting too far (reward hacking / mode collapse).

Why it works: Pretraining teaches knowledge. SFT teaches format. RLHF teaches values — what humans
actually prefer (helpful, harmless, honest). The reward model captures nuanced human judgements that are hard
to specify as a simple loss function.

Variants: DPO (Direct Preference Optimisation) — skips the reward model entirely, directly optimises preferences
using a closed-form objective. Simpler and often works as well.

15.2 — LoRA: Low-Rank Adaptation


One-line summary: Fine-tune LLMs by adding small trainable rank-decomposition matrices to frozen pretrained
weights. 10,000× fewer trainable parameters, similar quality.

Full fine-tuning a 7B parameter model requires storing and updating 7B gradients — expensive in memory and
compute. LoRA observes that weight updates during fine-tuning have low intrinsic rank. So instead of updating W
directly:

W' = W + ∆W = W + A·B where A ∈ R^(d×r), B ∈ R^(r×k), r << min(d,k)

W is frozen. Only A and B are trained. With rank r=8 for a 4096×4096 weight matrix: full params = 16.7M, LoRA
params = 65K — a 256× reduction.

Technique Description

LoRA Original. Add low-rank matrices to attention weight matrices (Q, K, V, O projections).

QLoRA LoRA on a 4-bit quantised model. Fine-tune 65B models on a single 48GB GPU.

LoRA+ Different learning rates for A and B matrices. Small improvement.

DoRA Decomposes weight into magnitude and direction. More expressive than LoRA.

Adapter layers Insert small trainable bottleneck layers between frozen transformer blocks. Older alterna
Practical tip: Apply LoRA to Q and V projections in attention. r=8 or r=16 covers most use cases. α (scaling factor)
typically set to 2×r.

15.3 — Quantisation
One-line summary: Represent model weights (and optionally activations) with fewer bits. A 7B FP32 model takes
28GB. The same model in 4-bit takes ~4GB — fits on a laptop GPU.

Format Bits per weight / Size (7B model)

FP32 (full precision) 32 bits / ~28 GB. Training standard.

FP16 / BF16 16 bits / ~14 GB. Inference default on modern GPUs.

INT8 8 bits / ~7 GB. Minimal quality loss with post-training quantisation.

INT4 / NF4 4 bits / ~4 GB. Used in QLoRA, bitsandbytes. Some quality loss.

GGUF ([Link]) Mixed precision (2–8 bit per layer). CPU inference. Used by Ollama.
Two main approaches:

• Post-Training Quantisation (PTQ): Quantise after training, using a small calibration dataset. Fast, no
retraining. Tools: bitsandbytes, GPTQ, AWQ.
• Quantisation-Aware Training (QAT): Simulate quantisation during training. Better quality but requires
retraining. Used for edge deployment (mobile, embedded).

15.4 — Embeddings
One-line summary: Dense vector representations of discrete objects (words, sentences, images, users) where
semantic similarity = geometric proximity in vector space.

Word embeddings:

• Word2Vec (2013): Predict a word from its context (CBOW) or context from word (Skip-gram). Produces
300-dim vectors. Famous result: king − man + woman ≈ queen.
• GloVe: Global Vectors. Co-occurrence matrix factorisation. Similar to Word2Vec, different training objective.
• Contextual embeddings (BERT, GPT): The same word gets different vectors in different contexts. 'Bank'
near 'river' ≠ 'bank' near 'money'. Much more powerful.
Sentence / document embeddings:

• Sentence-BERT: Fine-tuned BERT for semantic similarity. Output CLS token = sentence vector.
• OpenAI text-embedding-3: 3072-dim embeddings. State-of-the-art retrieval.
Applications:

• Semantic search / RAG (Retrieval-Augmented Generation)


• Recommendation systems (embed users and items in shared space)
• Clustering similar documents without labels
• Anomaly detection (outliers are far from cluster centres)
• Cross-modal retrieval: CLIP embeds images and text in the same space

15.5 — Contrastive Learning & CLIP


One-line summary: Train representations by pulling similar pairs close and pushing dissimilar pairs apart in
embedding space — without explicit labels.
SimCLR / Self-supervised contrastive learning:

1. Take an image. Create two augmented views (crop, flip, colour jitter).
2. Encode both views through the same network. They should be similar.
3. All other images in the batch are 'negatives' — push their embeddings apart.
NT-Xent loss:

L = −log[ exp(sim(z■,z■)/τ) / Σ_{k≠i} exp(sim(z■,z■)/τ) ]

CLIP (Contrastive Language-Image Pretraining — OpenAI 2021):

CLIP is trained on 400M image-text pairs from the internet. The objective: make the embedding of an image and its
caption similar; make all other image-text combinations dissimilar. This is contrastive learning across modalities.

Why CLIP is powerful:

• Zero-shot classification: compute similarity between image and text labels like 'a photo of a cat'.
• Cross-modal retrieval: search images with text queries.
• Foundation for text-to-image (Stable Diffusion uses CLIP to encode text prompts).
Section 16

Theory
Universal Approximation · Initialization · Gradient Flow Analysis

16.1 — Universal Approximation Theorem (UAT)


Statement: A feedforward neural network with a single hidden layer containing a sufficient number of neurons can
approximate any continuous function on a compact subset of R^n to arbitrary precision — given a non-polynomial
activation function.

What it means in practice:

• The capacity exists: Even shallow networks are theoretically expressive enough. The question is how many
neurons that requires — potentially exponentially many.
• Depth helps: Deeper networks can represent the same functions exponentially more efficiently than shallow
ones (depth-efficiency theorems).
• Learnability ≠ Approximation: UAT says the function exists, not that gradient descent will find it. Training
dynamics, initialisation, and optimisation matter enormously in practice.
• Generalisation not guaranteed: A network that memorises training data technically approximates the
training distribution perfectly. UAT says nothing about generalisation to unseen data.
Extended UAT (Hornik, 1991): Any continuous non-constant, bounded, and monotone activation function allows
universal approximation. This covers sigmoid and tanh. ReLU requires slightly different formulation (Sonoda &
Murata, 2017).

16.2 — Weight Initialisation Strategies


If all weights start at zero, all neurons compute identical gradients — symmetry is never broken, the network never
learns. If weights are too large, activations explode. Too small, and gradients vanish before reaching early layers.

Initialisation Formula & When to Use

Random (Gaussian) W ~ N(0, 0.01). Only works for very shallow networks. Vanishes in deep nets.

Xavier / Glorot W ~ U(−√(6/(n_in+n_out)), +√(6/(n_in+n_out))). Designed for tanh/sigmoid. Keeps varian

He / Kaiming W ~ N(0, √(2/n_in)). Designed for ReLU. Accounts for the fact that ReLU kills ~half of ac

Orthogonal Init W initialised as random orthogonal matrix. Good for RNNs — preserves gradient norms

Pre-trained weights Transfer learning. Fine-tune from ImageNet weights. Best for limited data.
The core principle:

Good initialisation keeps the variance of activations and gradients approximately constant across layers. This is
derived by ensuring Var(output) = Var(input) for each layer, which gives the constraints for Xavier and He
initialisation.

16.3 — Gradient Flow Analysis


Gradient flow describes how gradients move through the network during backpropagation. Diagnosing gradient
problems is an essential practical skill.

The two failure modes:


Problem Cause & Symptoms & Fix

Vanishing Gradients Cause: Multiplying many values < 1 (sigmoid/tanh saturation, deep networks). Symptom

Exploding Gradients Cause: Multiplying many values > 1 (large weights, poorly initialised RNNs). Symptom: L
Gradient clipping (for RNNs):

if ||g|| > threshold: g ← g · (threshold / ||g||)

How to diagnose gradient problems in practice:

• Log the L2 norm of gradients per layer during training (PyTorch: [Link]()).
• Plot gradient norms vs layer depth. Should be roughly constant.
• If early layers have norms ~0 → vanishing. If any layer has norms > 100 → exploding.
• Use gradient hooks in PyTorch to track activations and gradients.
Batch normalisation's role in gradient flow:

By normalising activations at each layer, BatchNorm prevents the internal covariate shift that causes gradients to
explode or vanish. It also makes the loss landscape smoother (Santurkar et al., 2018), which is why BatchNorm
allows higher learning rates.

Deep Learning Notes — Supplement · GestureControlledGame Project · IITB · 2026

You might also like