Deep Learning Notes Complete
Deep Learning Notes Complete
These notes consolidate everything across multiple sources on modern neural network architectures —
what they are, how they work, when to use them, and how they compare. Written for active recall and
interview prep.
1. Foundations — What Is a Neural Network?
A neural network is a computational model loosely inspired by how the human brain processes
information. More precisely, it is a function approximator — a mathematical system that learns a
mapping from inputs to outputs by adjusting millions of internal parameters (weights) based on data.
Input Layer Receives raw data — images, numbers, text tokens, 21 landmark (x,y) pairs →
coordinates 42 numbers
Hidden Layers Where learning happens. Each layer transforms the LSTM layers,
previous layer's output into increasingly abstract convolutional layers
representations
Output Layer Produces the final answer — a class score, probability 6 gesture class scores via
distribution, or continuous value softmax
Every neuron connects to every other neuron. All Only a subset of neurons activate for any given
parameters activate for every input. Like turning on input. Like turning on lights only in rooms you're
every light in a building. using.
High energy consumption. Thorough but slower. Much more energy efficient. Scales to massive
sizes.
Example: Standard fully connected layers Example: GPT-4 uses sparse attention. Mixtral
8x7B has 47B capacity at 13B compute cost — 4x
efficiency.
2. Convolutional Neural Networks (CNN)
One-line summary: CNN = Image Expert. Specialized filters that scan spatial data looking for patterns.
The key insight: a cat's ear looks the same whether it's in the top-left or bottom-right of the image. CNNs
exploit this translational invariance by sharing filter weights across the entire image.
Real-World Applications
• Face recognition in your phone's unlock system
• Instagram/Snapchat filters — real-time face landmark detection
• Medical imaging: X-ray analysis, tumour detection, MRI interpretation
• Self-driving cars: object detection, lane recognition, traffic sign classification
• Google Photos: automatic scene and object tagging
• Quality control in manufacturing: defect detection on production lines
Limitations
• Poor at sequential/temporal data — no memory of previous inputs
• Requires large labelled datasets to train effectively
• Computationally expensive for very high-resolution images
3. Recurrent Neural Networks (RNN)
One-line summary: RNN = Sequence & Memory Network. Processes data step by step while
maintaining a hidden state that carries information forward.
RNNs solve this by maintaining a hidden state — a vector that gets updated at each timestep and carries
information from previous steps into the current one.
ht = tanh(Wh · ht-1 + Wx · xt + b)
The same weights (Wh, Wx, b) are shared across all timesteps. This is what makes it 'recurrent' — the
same function applied repeatedly with its own output fed back in.
After 20-30 timesteps, the gradient becomes effectively zero — the network can no longer learn
dependencies from that far back. This is called the vanishing gradient problem. For long sequences like
a paragraph of text or a long piece of music, vanilla RNNs simply forget the beginning by the time they
reach the end.
Real-World Applications
• Autocomplete and text prediction (early versions)
• Stock price prediction over short windows
• Early speech recognition systems
• Weather forecasting over short time horizons
Limitations
• Vanishing gradient — forgets long-range dependencies
• Sequential processing prevents parallelization — slow to train
• Struggles with sequences longer than ~30 steps
• Can be unstable during training
4. Long Short-Term Memory (LSTM)
One-line summary: LSTM = Long-Term Memory Expert. An advanced RNN that uses gating
mechanisms to selectively remember and forget, solving the vanishing gradient problem.
Forget Gate ft How much of the previous cell state to keep. Reading about a new character
Output 0 = forget everything, 1 = keep in a story — forget the old
everything. character's gender/name.
Input Gate it Which new information to write into the cell The new character's name and
state. role gets stored.
Output Gate ot Which part of the cell state to expose as the Only output what's relevant for
hidden state output. predicting the next word.
We got 97.9% validation accuracy on 1,205 samples with a 2-layer LSTM (hidden_size=128,
dropout=0.3) trained for 50 epochs.
Real-World Applications
• Netflix automatic subtitle generation
• Siri and Alexa early speech recognition layers
• Algorithmic trading systems — market movement prediction
• Machine translation (pre-transformer era)
• Sentiment analysis on long text
• Options pricing models (your project — LSTM on historical option contract data)
Limitations
• More complex than vanilla RNNs — slower to train
• Still sequential — can't be fully parallelized
• Struggles with very long sequences (1000+ steps) — transformers do better here
• Higher memory consumption due to cell state + hidden state
5. Gated Recurrent Unit (GRU)
One-line summary: GRU = Simplified LSTM. Combines the forget and input gates into a single update
gate — fewer parameters, faster training, often similar performance.
• Update Gate: Decides how much of the past information to keep vs how much of the new information
to write. Combines LSTM's forget + input gates.
• Reset Gate: Decides how much of the past hidden state to expose when computing the new
candidate hidden state. Controls how much past context influences the new update.
Complex, long sequences where accuracy matters Computational efficiency matters, or dataset is
most smaller
You have enough compute and training time You want faster iteration / experimentation
Tasks with very long-range dependencies Similar performance to LSTM is acceptable with
fewer parameters
6. Transformers
One-line summary: Transformers = Attention Is All You Need. Processes entire sequences
simultaneously using self-attention instead of sequential hidden states. Powers ChatGPT, Claude,
BERT, GPT-4.
Vaswani et al. (Google Brain, 2017) threw out the recurrence entirely and replaced it with self-attention —
a mechanism that allows every position in the sequence to directly attend to every other position
simultaneously, regardless of distance.
The attention score between two tokens is computed as: softmax(Q · KT / √dk) · V. The result: each token
gets a weighted sum of all other tokens' values, weighted by how relevant they are.
Example: In 'The cat sat on the mat', when processing 'sat', self-attention simultaneously looks at 'cat'
(who's sitting?) and 'mat' (sitting on what?) — not step by step, but all at once.
Multi-Head Attention
Instead of one set of Q/K/V projections, transformers use multiple 'heads' in parallel — each learning to
attend to different types of relationships simultaneously. One head might learn syntactic structure, another
semantic similarity, another co-reference. Their outputs are concatenated and projected.
Positional Encoding
Since transformers process all positions in parallel, they have no inherent sense of order. Positional
encodings (sinusoidal functions of position) are added to the input embeddings to inject order information.
Limitations
• Quadratic attention cost — self-attention is O(n²) in sequence length. 1000-token sequence = 1M
attention scores.
• Requires enormous compute and memory for training
• High-quality data dependency — more sensitive to data quality than LSTMs
• Overkill for simple tasks — don't use a transformer to classify 6 gestures from 1,205 samples
7. Mixture of Experts (MoE)
One-line summary: MoE = Team of Specialists. Multiple expert subnetworks + a router that decides
which 1-2 experts handle each input. Massive model capacity at fraction of compute cost.
This is why GPT-4 can be simultaneously great at coding, creative writing, math, and science. A dense
model of equivalent quality would cost ~10x more to run.
Full Pipeline
Stage What happens Architecture used File
Resume-Level Description
"Built a real-time gesture-controlled Snake game using MediaPipe hand tracking, a 2-layer LSTM
sequence classifier (97.9% val accuracy, 1,205 training samples, 6 gesture classes), and a DQN
reinforcement learning agent trained to play Snake autonomously. Pipeline: webcam → landmark
extraction → sliding-window LSTM inference → game control, running at 28 FPS on CPU."
11. Quick Reference — Interview Cheat Sheet
Question Answer
What's the vanishing gradient During backprop through time, gradients are multiplied at each step.
problem? If <1, they shrink exponentially. By step 30+, gradient ≈ 0 and the
network can't learn long-range dependencies. LSTM solves this
with the cell state highway.
How does LSTM solve vanishing The cell state flows through the network with only additive
gradient? interactions (not multiplicative). Gradients flow back through
addition, not multiplication — so they don't vanish. The forget gate
controls how much to preserve.
CNN vs RNN in one sentence? CNN finds spatial patterns in fixed-size grids (images). RNN finds
temporal patterns in variable-length sequences (time series, text).
CNN has no memory; RNN does.
Why transformers over LSTMs for LSTMs are sequential — can't be parallelized. Transformers
NLP? process all positions simultaneously via self-attention. This enables
training on massive datasets in reasonable time. Also better at very
long-range dependencies.
What is self-attention? Each token computes Query, Key, Value vectors. Attention score =
softmax(Q·K^T / √d_k). Output = weighted sum of all Values. Every
token directly attends to every other token — O(n²) but
parallelizable.
LSTM vs GRU? GRU merges cell state + hidden state and uses 2 gates instead of
3. Fewer parameters, faster training, often similar accuracy. Choose
LSTM for accuracy-critical tasks; GRU for efficiency.
What is MoE? Multiple specialist networks + a router. For each input, router
selects 1-2 experts. Others stay dormant. Enables massive model
capacity (47B params) at fraction of compute cost (13B active).
Powers GPT-4 (suspected), Mixtral.
Why LSTM not CNN for gestures? Our input is 21 (x,y) coordinate pairs — not pixel grids. CNN
convolves over spatial structure that doesn't exist here. LSTM is the
right tool: it learns temporal trajectory patterns from the sequence of
coordinate frames.
These sections extend the original notes with the five missing pillars of deep learning: mathematical
fundamentals, training methodology, additional architectures, modern fine-tuning techniques, and theoretical
grounding. Same format — written for active recall and interview prep.
Section 12
Fundamentals
Backpropagation · Loss Functions · Optimizers · Activation Functions
12.1 — Backpropagation
Backpropagation is the algorithm that makes neural networks learn. It computes the gradient of the loss with
respect to every weight in the network by applying the chain rule of calculus layer by layer, from output back to
input.
Computational graphs: Modern frameworks (PyTorch, TensorFlow) build a dynamic graph of operations during
the forward pass, then traverse it in reverse during backprop. This is why PyTorch's autograd just works — every
operation is recorded.
Mean Squared Error (MSE) Regression. L = (1/n)Σ(y − ■)². Penalises large errors heavily.
Categorical Cross-Entropy Multi-class. Standard loss for CNNs, LSTMs with softmax.
Huber Loss Regression robust to outliers. L1 for large errors, L2 for small.
12.3 — Optimizers
The optimizer decides how to update weights using the computed gradients. Vanilla gradient descent uses a fixed
step in gradient direction — but real training is more nuanced.
SGD (Stochastic GD) Update per mini-batch, not full dataset. Fast but noisy. W ← W − η·∇L
Adam (Adaptive Moment) Combines momentum + RMSProp. Keeps running mean (m) and variance (v) of gradien
AdamW Adam + weight decay (L2 regularisation applied correctly). Preferred for transformers (G
Learning Rate Schedulers Reduce η over training: StepLR (decay every k epochs), CosineAnnealing, OneCycleLR
Interview tip: Adam is the default starting point. Use AdamW for transformers. SGD + momentum often achieves
better final accuracy if you tune carefully (e.g. ResNet ImageNet training).
Sigmoid σ(x) = 1/(1+e■■). Output: (0,1). Saturates → vanishing gradients. Use only in output for
Tanh tanh(x) = (e■−e■■)/(e■+e■■). Output: (−1,1). Still saturates but zero-centred. Used in
ReLU max(0,x). Fast, sparse activation. Problem: 'dying ReLU' — neurons stuck at 0 if input al
Leaky ReLU max(0.01x, x). Fixes dying ReLU by allowing small negative gradient.
ELU x if x>0 else α(e■−1). Smooth, negative saturation. Better than Leaky ReLU in some cas
GELU x·Φ(x). Gaussian-gated. Used in BERT, GPT, transformers. Smoother than ReLU.
Softmax e■■/Σe■■. Converts logits to probability distribution. Always use at output for multi-clas
Section 13
Training Concepts
Overfitting · Regularization · Data Splits · Augmentation · Hyperparameter Tuning
L2 / Weight Decay Adds λ·||W||² to loss. Penalises large weights, keeps them small. Equivalent to Gaussian
L1 Regularisation Adds λ·||W||■ to loss. Promotes sparsity — many weights go exactly to 0. Useful for fea
Dropout During training, randomly zero out neurons with probability p (typically 0.2–0.5). Forces r
Batch Normalisation Normalise activations across the mini-batch: (x−µ)/σ, then scale/shift with learned γ,β. St
Layer Normalisation Same idea but normalise across features instead of batch. Used in transformers (batch s
Early Stopping Monitor validation loss. Stop training when val loss stops improving. Saves the checkpoi
Data Augmentation Artificially expand training set (see 13.3). Prevents memorisation of specific examples.
Training set (60–80%) Model sees this data and updates weights from it.
Validation set (10–20%) Used to tune hyperparameters, choose architecture, do early stopping. Model never train
Test set (10–20%) Evaluated once at the very end to report final performance. Never use for any decisions
Cross-validation: When data is scarce, use k-fold CV. Split data into k folds, train on k−1 folds, validate on the
remaining fold. Rotate k times and average results. 5-fold and 10-fold are most common. Gives more reliable
estimates but is k× slower.
Stratified splits: For imbalanced datasets, ensure each split has the same class distribution as the full dataset.
Use sklearn's StratifiedKFold.
Grid Search Try every combination of a predefined parameter grid. Exhaustive but expensive. Only fe
Random Search Sample random combinations. Empirically as good as grid search in fewer tries (Bergstr
Bayesian Optimisation Build a probabilistic model of the loss landscape. Uses past results to choose next trials
Learning Rate Finder Sweep LR from very low to very high over a few batches. Plot loss vs LR. Pick LR just b
Population-Based Training Train many models in parallel. Periodically copy weights from better-performing models
Key hyperparameters to tune: learning rate (most important), batch size, network depth/width, dropout rate,
weight decay, number of epochs.
Section 14
Additional Architectures
Autoencoders · VAEs · GANs · Diffusion · GNNs · ResNets · U-Net
14.1 — Autoencoders
One-line summary: Compress input into a small representation (latent code), then reconstruct it. Learn to keep
only the essential information.
An autoencoder has two parts: an Encoder that maps input X → latent vector z (a bottleneck), and a Decoder that
maps z → X■ (reconstruction). Trained with reconstruction loss (MSE or cross-entropy between X and X■).
Vanilla Autoencoder Deterministic encoder. Latent space has no structure — you can't sample from it.
Denoising Autoencoder Add noise to input, train to reconstruct the clean version. Forces robust representations.
Sparse Autoencoder Add sparsity penalty to latent layer. Most neurons inactive. Learns disentangled features
Variational Autoencoder (VAE) Encoder outputs µ and σ. Sample z ~ N(µ,σ). Structured latent space → generative mod
The key innovation: instead of encoding to a single point, the encoder outputs a mean µ and variance σ². During
training, z is sampled from N(µ, σ²) using the reparameterisation trick: z = µ + σ·ε where ε ~ N(0,1). This makes
gradients flow through the sampling operation.
The KL term regularises the latent space toward a standard normal. This is what makes it smooth and interpolable
— nearby points in latent space decode to similar outputs.
Applications: Image generation, anomaly detection (high reconstruction error = anomaly), molecule generation in
drug discovery, face interpolation.
DCGAN Deep Convolutional GAN. First stable image GAN. Uses batch norm, no FC layers.
StyleGAN / StyleGAN2 NVIDIA. Photorealistic faces. Separates style at each resolution level.
CycleGAN Image-to-image translation without paired data. Horse ↔ Zebra, Photo ↔ Painting.
Conditional GAN (cGAN) Generator and Discriminator conditioned on class label. Control what gets generated.
Noise is added gradually over T steps (typically T=1000). x_T is approximately pure Gaussian noise N(0,I).
A U-Net (see 14.7) is trained to predict the noise ε added at each step, then subtract it: p_θ(x_{t-1} | x_t) =
N(µ_θ(x_t, t), Σ_θ).
Core operation: message passing. Each node aggregates information from its neighbours, updates its own
representation, and repeats for k layers.
GCN (Graph Conv Net) Spectral graph convolution. Approximated as mean of neighbour features.
GAT (Graph Attention Net) Attention weights on edges — different neighbours contribute differently.
GIN (Graph Isomorphism Net) Provably as powerful as the Weisfeiler-Leman graph isomorphism test.
Applications: Drug discovery (molecules = graphs), social network analysis, recommendation systems,
knowledge graphs, traffic prediction, protein structure prediction (AlphaFold uses attention similar to GNNs).
Without skip connections, very deep networks suffer from degradation — adding more layers makes accuracy
worse, even on training data (not overfitting — the network simply can't learn the identity mapping through many
non-linear layers).
Residual block:
The network only needs to learn the residual F(x) = H(x) − x. If the optimal mapping is near-identity, learning F(x) ≈
0 is much easier than learning H(x) ≈ x through non-linear layers.
Why gradients flow: The skip connection adds the input directly to the output, so ∂L/∂x = ∂L/∂output · (1 + ∂F/∂x).
The +1 term ensures gradients never vanish completely, even across 100+ layers.
ResNet variants: ResNet-18/34/50/101/152 (original), Wide ResNets (more channels), ResNeXt (grouped
convolutions), DenseNet (connect every layer to all subsequent layers).
14.7 — U-Net
One-line summary: Encoder-decoder CNN with skip connections between encoder and decoder at each
resolution level. Designed for image segmentation — outputs a pixel-wise label map.
The U-shape comes from: the encoder (left side) downsamples the image through conv+pool layers, capturing
semantic content. The decoder (right side) upsamples back to the original resolution. Skip connections
concatenate encoder feature maps to the corresponding decoder level, preserving fine spatial detail that
downsampling would otherwise lose.
Medical image segmentation Original use case — segment tumours in MRI/CT scans.
Satellite image segmentation Identify buildings, roads, water bodies pixel by pixel.
Modern Techniques
RLHF · LoRA · Quantization · Embeddings · Contrastive Learning
Stage 1 — Supervised Fine-Tuning (SFT): Start with a pretrained LLM (e.g. GPT-3.5). Fine-tune it on a curated
dataset of high-quality prompt-response pairs written by human trainers. This teaches the model the desired format
and basic helpfulness.
Stage 2 — Reward Model Training: Show human labellers pairs of model responses to the same prompt. They
rank which response is better. Train a separate reward model (another LLM with a scalar output head) to predict
these human preferences.
Stage 3 — PPO Fine-Tuning: Use the reward model as a proxy for human preference. Fine-tune the SFT model
with PPO (Proximal Policy Optimisation) to maximise reward. A KL divergence penalty against the SFT model
prevents the model from drifting too far (reward hacking / mode collapse).
Why it works: Pretraining teaches knowledge. SFT teaches format. RLHF teaches values — what humans
actually prefer (helpful, harmless, honest). The reward model captures nuanced human judgements that are hard
to specify as a simple loss function.
Variants: DPO (Direct Preference Optimisation) — skips the reward model entirely, directly optimises preferences
using a closed-form objective. Simpler and often works as well.
Full fine-tuning a 7B parameter model requires storing and updating 7B gradients — expensive in memory and
compute. LoRA observes that weight updates during fine-tuning have low intrinsic rank. So instead of updating W
directly:
W is frozen. Only A and B are trained. With rank r=8 for a 4096×4096 weight matrix: full params = 16.7M, LoRA
params = 65K — a 256× reduction.
Technique Description
LoRA Original. Add low-rank matrices to attention weight matrices (Q, K, V, O projections).
QLoRA LoRA on a 4-bit quantised model. Fine-tune 65B models on a single 48GB GPU.
DoRA Decomposes weight into magnitude and direction. More expressive than LoRA.
Adapter layers Insert small trainable bottleneck layers between frozen transformer blocks. Older alterna
Practical tip: Apply LoRA to Q and V projections in attention. r=8 or r=16 covers most use cases. α (scaling factor)
typically set to 2×r.
15.3 — Quantisation
One-line summary: Represent model weights (and optionally activations) with fewer bits. A 7B FP32 model takes
28GB. The same model in 4-bit takes ~4GB — fits on a laptop GPU.
INT4 / NF4 4 bits / ~4 GB. Used in QLoRA, bitsandbytes. Some quality loss.
GGUF ([Link]) Mixed precision (2–8 bit per layer). CPU inference. Used by Ollama.
Two main approaches:
• Post-Training Quantisation (PTQ): Quantise after training, using a small calibration dataset. Fast, no
retraining. Tools: bitsandbytes, GPTQ, AWQ.
• Quantisation-Aware Training (QAT): Simulate quantisation during training. Better quality but requires
retraining. Used for edge deployment (mobile, embedded).
15.4 — Embeddings
One-line summary: Dense vector representations of discrete objects (words, sentences, images, users) where
semantic similarity = geometric proximity in vector space.
Word embeddings:
• Word2Vec (2013): Predict a word from its context (CBOW) or context from word (Skip-gram). Produces
300-dim vectors. Famous result: king − man + woman ≈ queen.
• GloVe: Global Vectors. Co-occurrence matrix factorisation. Similar to Word2Vec, different training objective.
• Contextual embeddings (BERT, GPT): The same word gets different vectors in different contexts. 'Bank'
near 'river' ≠ 'bank' near 'money'. Much more powerful.
Sentence / document embeddings:
• Sentence-BERT: Fine-tuned BERT for semantic similarity. Output CLS token = sentence vector.
• OpenAI text-embedding-3: 3072-dim embeddings. State-of-the-art retrieval.
Applications:
1. Take an image. Create two augmented views (crop, flip, colour jitter).
2. Encode both views through the same network. They should be similar.
3. All other images in the batch are 'negatives' — push their embeddings apart.
NT-Xent loss:
CLIP is trained on 400M image-text pairs from the internet. The objective: make the embedding of an image and its
caption similar; make all other image-text combinations dissimilar. This is contrastive learning across modalities.
• Zero-shot classification: compute similarity between image and text labels like 'a photo of a cat'.
• Cross-modal retrieval: search images with text queries.
• Foundation for text-to-image (Stable Diffusion uses CLIP to encode text prompts).
Section 16
Theory
Universal Approximation · Initialization · Gradient Flow Analysis
• The capacity exists: Even shallow networks are theoretically expressive enough. The question is how many
neurons that requires — potentially exponentially many.
• Depth helps: Deeper networks can represent the same functions exponentially more efficiently than shallow
ones (depth-efficiency theorems).
• Learnability ≠ Approximation: UAT says the function exists, not that gradient descent will find it. Training
dynamics, initialisation, and optimisation matter enormously in practice.
• Generalisation not guaranteed: A network that memorises training data technically approximates the
training distribution perfectly. UAT says nothing about generalisation to unseen data.
Extended UAT (Hornik, 1991): Any continuous non-constant, bounded, and monotone activation function allows
universal approximation. This covers sigmoid and tanh. ReLU requires slightly different formulation (Sonoda &
Murata, 2017).
Random (Gaussian) W ~ N(0, 0.01). Only works for very shallow networks. Vanishes in deep nets.
He / Kaiming W ~ N(0, √(2/n_in)). Designed for ReLU. Accounts for the fact that ReLU kills ~half of ac
Orthogonal Init W initialised as random orthogonal matrix. Good for RNNs — preserves gradient norms
Pre-trained weights Transfer learning. Fine-tune from ImageNet weights. Best for limited data.
The core principle:
Good initialisation keeps the variance of activations and gradients approximately constant across layers. This is
derived by ensuring Var(output) = Var(input) for each layer, which gives the constraints for Xavier and He
initialisation.
Vanishing Gradients Cause: Multiplying many values < 1 (sigmoid/tanh saturation, deep networks). Symptom
Exploding Gradients Cause: Multiplying many values > 1 (large weights, poorly initialised RNNs). Symptom: L
Gradient clipping (for RNNs):
• Log the L2 norm of gradients per layer during training (PyTorch: [Link]()).
• Plot gradient norms vs layer depth. Should be roughly constant.
• If early layers have norms ~0 → vanishing. If any layer has norms > 100 → exploding.
• Use gradient hooks in PyTorch to track activations and gradients.
Batch normalisation's role in gradient flow:
By normalising activations at each layer, BatchNorm prevents the internal covariate shift that causes gradients to
explode or vanish. It also makes the loss landscape smoother (Santurkar et al., 2018), which is why BatchNorm
allows higher learning rates.