Advanced Artificial Intelligence - Complete Course Notes
Course Code: EI6523803
Total Hours: 45 Hours (9 hours per unit)
UNIT I: INTRODUCTION TO DEEP LEARNING & PERCEPTRONS (9 Hours)
1.1 Understanding Deep Learning
What is Deep Learning?
Deep Learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence
"deep") to progressively extract higher-level features from raw input. Unlike traditional machine learning where
you manually engineer features, deep learning automatically learns hierarchical representations of data.
Real-World Example: When you upload a photo to Facebook and it automatically tags your friends, that's deep
learning at work. The network learns to recognize faces through multiple layers - first detecting edges, then
facial features like eyes and nose, then complete faces, and finally identifying specific individuals.
The Need for Deep Learning:
Traditional machine learning algorithms plateau when you feed them more data. Deep learning models,
especially neural networks, continue to improve their performance as you provide more training data. This
makes them ideal for:
Image and video recognition
Natural language processing
Speech recognition
Autonomous vehicles
Medical diagnosis
Applications Across Industries:
1. Healthcare: Detecting diseases from X-rays and MRI scans with accuracy matching or exceeding human
radiologists
2. Finance: Fraud detection, algorithmic trading, credit scoring
3. Retail: Recommendation systems (Netflix, Amazon), inventory management
4. Automotive: Self-driving cars using computer vision and sensor fusion
5. Entertainment: Deepfakes, content generation, game AI
1.2 Neural Networks: The Foundation
Biological Inspiration:
The artificial neural network mimics how neurons in your brain work. Each biological neuron:
Receives signals through dendrites
Processes the signal in the cell body
Sends output through the axon if the signal is strong enough
Similarly, an artificial neuron:
Receives weighted inputs
Sums them up and adds a bias
Applies an activation function
Outputs the result
Mathematical Representation:
For a single neuron:
z = (w₁ × x₁) + (w₂ × x₂) + ... + (wₙ × xₙ) + b
output = activation_function(z)
Where:
x₁, x₂, ..., xₙ are inputs
w₁, w₂, ..., wₙ are weights
b is bias
z is the weighted sum
Example Calculation:
Let's say you're predicting if a student will pass an exam based on:
Hours studied (x₁ = 5)
Previous test score (x₂ = 75)
With weights w₁ = 0.3, w₂ = 0.02, and bias b = -2:
z = (0.3 × 5) + (0.02 × 75) + (-2)
z = 1.5 + 1.5 - 2 = 1.0
1.3 Perceptron: The Simplest Neural Network
Concepts of Perceptron:
A perceptron is the fundamental building block of neural networks, invented by Frank Rosenblatt in 1958. It's a
binary classifier that makes decisions by weighing evidence.
How Perceptron Works:
1. Input Layer: Receives features
2. Weights: Each input has an associated weight showing its importance
3. Summation: Calculates weighted sum
4. Activation: Applies a step function (outputs 0 or 1)
Bias in Neural Networks:
Bias is like the y-intercept in a linear equation. It allows the activation function to be shifted left or right, which
is critical for fitting the data better.
Think of it this way: If you're deciding whether to go to a party:
Weights determine how much you care about each factor (friends going, location, time)
Bias represents your general enthusiasm for parties (some people are naturally more inclined to go)
Perceptron Learning Algorithm:
The perceptron learns through a simple update rule:
For each training example (x, y):
prediction = step_function(w · x + b)
if prediction ≠ y:
w = w + η × (y - prediction) × x
b = b + η × (y - prediction)
Where η (eta) is the learning rate (typically 0.01 to 0.3)
Practical Example:
Let's build a perceptron to classify if it's a good day for ice cream:
Input 1: Temperature (normalized 0-1)
Input 2: Is it sunny? (1=yes, 0=no)
Initial weights: w₁ = 0.5, w₂ = 0.3, bias = -0.4
Training Example 1: Temp=0.8, Sunny=1, Label=1 (Good day)
z = (0.5 × 0.8) + (0.3 × 1) - 0.4 = 0.4 + 0.3 - 0.4 = 0.3
prediction = step(0.3) = 1 ✓ (Correct!)
Training Example 2: Temp=0.2, Sunny=0, Label=0 (Bad day)
z = (0.5 × 0.2) + (0.3 × 0) - 0.4 = 0.1 - 0.4 = -0.3
prediction = step(-0.3) = 0 ✓ (Correct!)
1.4 Advantages and Disadvantages of Deep Learning
Advantages:
1. Automatic Feature Learning: No need for manual feature engineering
2. Scalability: Performance improves with more data
3. Versatility: Works across multiple domains (vision, speech, text)
4. Transfer Learning: Models trained on one task can be adapted to others
5. End-to-End Learning: Can learn directly from raw data to output
Disadvantages:
1. Data Hungry: Requires massive amounts of labeled data
2. Computational Cost: Needs powerful GPUs for training
3. Black Box Nature: Hard to interpret why a model made a specific decision
4. Overfitting Risk: Can memorize training data instead of learning patterns
5. Long Training Time: Complex models can take days or weeks to train
1.5 Logic Gates and Simple Networks
Implementing AND, OR, NAND, XOR with Perceptrons:
AND Gate:
Inputs: x₁, x₂ ∈ {0, 1}
Weights: w₁ = 1, w₂ = 1
Bias: b = -1.5
Truth Table:
x₁ | x₂ | z = x₁ + x₂ - 1.5 | output
0 | 0 | -1.5 |0
0 | 1 | -0.5 |0
1 | 0 | -0.5 |0
1 | 1 | 0.5 |1
OR Gate:
Weights: w₁ = 1, w₂ = 1
Bias: b = -0.5
The threshold is lower, so any 1 input triggers output
NAND Gate:
Weights: w₁ = -1, w₂ = -1
Bias: b = 1.5
Negative weights invert the AND gate
XOR Problem - The Perceptron's Limitation:
XOR (exclusive OR) cannot be solved by a single perceptron because it's not linearly separable. This was
discovered by Marvin Minsky and Seymour Papert in 1969, causing the first "AI Winter."
XOR Truth Table:
x₁ | x₂ | output
0 |0 |0
0 |1 |1
1 |0 |1
1 |1 |0
You cannot draw a single straight line to separate the 1s from the 0s on a 2D plot.
Solution: Multi-layer networks! By combining multiple perceptrons, we can solve XOR:
First layer: Create AND and NAND gates
Second layer: Combine with OR gate
This discovery led to the development of multi-layer perceptrons and backpropagation.
UNIT II: INTRODUCTION TO NEURAL NETWORKS & TRAINING (9 Hours)
2.1 Neural Network Architecture
From Perceptron to Neural Network:
A neural network is simply multiple perceptrons organized in layers:
1. Input Layer: Receives raw data
2. Hidden Layers: Perform intermediate computations (this is where the "deep" in deep learning comes from)
3. Output Layer: Produces final predictions
Why Multiple Layers Matter:
Each layer learns increasingly abstract representations:
Layer 1: In image recognition, might detect edges and curves
Layer 2: Combines edges into shapes and textures
Layer 3: Recognizes parts (eyes, wheels, windows)
Layer 4: Identifies complete objects (faces, cars, buildings)
2.2 Activation Functions
Activation functions introduce non-linearity into neural networks. Without them, no matter how many layers
you stack, the network would just be a linear function!
Step Function (Used in Perceptron):
f(x) = 1 if x ≥ 0
f(x) = 0 if x < 0
Problem: Not differentiable, can't use gradient descent
Sigmoid Function:
σ(x) = 1 / (1 + e^(-x))
Range: (0, 1)
Use Case: Binary classification, output probabilities
Problem: Vanishing gradient for extreme values
Gradient: σ'(x) = σ(x) × (1 - σ(x))
Example:
If x = 0: σ(0) = 1/(1+1) = 0.5
If x = 2: σ(2) = 1/(1+e^(-2)) ≈ 0.88
If x = -2: σ(-2) = 1/(1+e^2) ≈ 0.12
Tanh (Hyperbolic Tangent):
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Range: (-1, 1)
Advantage: Zero-centered, making learning easier
Problem: Still suffers from vanishing gradient
ReLU (Rectified Linear Unit):
ReLU(x) = max(0, x)
Range: [0, ∞)
Advantages:
Simple and fast to compute
Doesn't saturate for positive values
Sparse activation (many neurons output zero)
Problem: Dying ReLU (neurons can get stuck at zero)
Most popular: Used in ~90% of modern deep networks
Example:
ReLU(-3) = 0
ReLU(0) = 0
ReLU(5) = 5
Softmax (Output Layer for Multi-class):
softmax(xi) = e^(xi) / Σ(e^(xj))
Converts logits into probabilities that sum to 1.
Example: Three-class classification
Logits: [2.0, 1.0, 0.1]
e^2.0 = 7.39
e^1.0 = 2.72
e^0.1 = 1.11
Sum = 11.22
Probabilities:
Class 1: 7.39/11.22 = 0.659 (65.9%)
Class 2: 2.72/11.22 = 0.242 (24.2%)
Class 3: 1.11/11.22 = 0.099 (9.9%)
The Training Problem:
Training a neural network means finding the best weights and biases that minimize prediction errors. This is an
optimization problem.
Loss Functions:
Mean Squared Error (MSE) - Regression:
MSE = (1/n) Σ(yi - ŷi)²
Where:
yi = actual value
ŷi = predicted value
n = number of samples
Cross-Entropy Loss - Classification:
L = -Σ yi × log(ŷi)
For binary classification:
L = -[y × log(ŷ) + (1-y) × log(1-ŷ)]
Gradient Descent Algorithm:
Imagine you're hiking down a mountain in fog. You can only see a few feet around you. How do you reach the
bottom? Take steps in the steepest downward direction!
Update Rule:
w_new = w_old - η × (∂L/∂w)
Where:
η = learning rate (step size)
∂L/∂w = gradient (slope of loss function)
Learning Rate Importance:
Too small: Training is painfully slow
Too large: You might overshoot the minimum and diverge
Just right: Typically 0.001 to 0.1
Example Calculation:
Let's minimize f(x) = x² starting at x = 10, learning rate η = 0.1
Iteration 1:
f'(10) = 2×10 = 20
x_new = 10 - 0.1×20 = 8
Iteration 2:
f'(8) = 2×8 = 16
x_new = 8 - 0.1×16 = 6.4
Iteration 3:
f'(6.4) = 12.8
x_new = 6.4 - 0.1×12.8 = 5.12
... continues until x ≈ 0
Variants of Gradient Descent:
1. Batch Gradient Descent: Uses entire dataset for each update (slow but stable)
2. Stochastic Gradient Descent (SGD): Uses one sample at a time (fast but noisy)
3. Mini-Batch Gradient Descent: Uses small batches (best of both worlds) - typically 32, 64, or 128 samples
2.4 Backpropagation: The Heart of Neural Network Training
The Chain Rule Connection:
Backpropagation applies the chain rule of calculus to compute gradients efficiently. It answers: "How much did
each weight contribute to the final error?"
Forward Pass:
1. Input data flows through the network
2. Each layer computes activations
3. Final layer produces predictions
4. Compute loss
Backward Pass:
1. Start from the loss
2. Compute gradient of loss with respect to output
3. Propagate gradients backward through each layer
4. Update weights using these gradients
Mathematical Flow:
For a simple network: Input → Hidden → Output
Forward:
h = σ(W1 × x + b1) [hidden layer]
y = σ(W2 × h + b2) [output layer]
L = (y - target)² [loss]
Backward:
∂L/∂W2 = ∂L/∂y × ∂y/∂W2
∂L/∂W1 = ∂L/∂y × ∂y/∂h × ∂h/∂W1
Concrete Example:
Network: 2 inputs → 2 hidden → 1 output
Given:
Input: [0.5, 0.8]
Target: 0.9
Weights randomly initialized
Forward Pass:
h1 = σ(0.5×w11 + 0.8×w12) = 0.6
h2 = σ(0.5×w21 + 0.8×w22) = 0.7
output = σ(0.6×w31 + 0.7×w32) = 0.5
Loss = (0.9 - 0.5)² = 0.16
Backward Pass:
Calculate how output error affects each weight
Update: w_new = w_old - η × gradient
The beauty is that this scales to millions of parameters!
2.5 Multi-Layer Perceptron (MLP) Architecture
Structure:
An MLP consists of:
Input layer: One neuron per feature
Hidden layers: Typically 1-5 layers with 10-1000 neurons each
Output layer: Depends on task (1 for regression, K for K-class classification)
Fully Connected (Dense) Layers:
Every neuron in one layer connects to every neuron in the next layer.
Example Architecture for MNIST Digit Recognition:
Input: 784 neurons (28×28 pixel image)
Hidden 1: 128 neurons with ReLU
Hidden 2: 64 neurons with ReLU
Output: 10 neurons with Softmax (digits 0-9)
Total parameters ≈ 100,000+
When to Use MLPs:
Tabular data (structured data with rows and columns)
When you don't need spatial/sequential structure
Quick baseline models
Limitations:
Doesn't preserve spatial structure (bad for images)
Can't handle variable-length inputs (bad for text)
Requires fixed-size inputs
2.6 Hands-On Training Process
Complete Training Loop:
python
# Pseudocode for training a neural network
for epoch in range(num_epochs):
for batch in training_data:
# 1. Forward Pass
predictions = model([Link])
# 2. Compute Loss
loss = loss_function(predictions, [Link])
# 3. Backward Pass
gradients = backpropagate(loss)
# 4. Update Weights
[Link]([Link], gradients)
# 5. Validation
val_loss = evaluate(model, validation_data)
print(f"Epoch {epoch}: Train Loss={loss}, Val Loss={val_loss}")
Key Training Concepts:
Epochs: One complete pass through the entire training dataset
Batch Size: Number of samples processed before updating weights
Overfitting vs Underfitting:
Underfitting: Model too simple, high training and validation error
Good Fit: Low training and validation error
Overfitting: Model memorizes training data, low training error but high validation error
Prevention Techniques:
1. Dropout: Randomly disable neurons during training
2. L2 Regularization: Add penalty for large weights
3. Early Stopping: Stop training when validation loss stops improving
4. Data Augmentation: Create variations of training data
UNIT III: CONVOLUTIONAL NEURAL NETWORK (9 Hours)
3.1 Why CNNs? The Problem with Regular Neural Networks for Images
The Dimensionality Challenge:
Consider a small 28×28 grayscale image:
Using MLP: 784 input neurons
Add color (RGB): 28×28×3 = 2,352 neurons
High-resolution image (1000×1000×3): 3,000,000 input neurons!
For a 1000-neuron hidden layer:
Parameters = 3,000,000 × 1,000 = 3 billion weights
This is just the first layer!
Problems with MLPs for Images:
1. Too many parameters → Overfitting, memory issues
2. No spatial awareness → Doesn't understand that nearby pixels are related
3. Not translation invariant → Can't recognize an object if it moves slightly
CNN Solution:
CNNs exploit three key ideas:
1. Local Connectivity: Neurons only connect to a small region
2. Parameter Sharing: Use the same filter across the entire image
3. Spatial Hierarchy: Build complex features from simple ones
3.2 Convolutional Layers: The Core Building Block
How Convolution Works:
Think of convolution as sliding a small window (filter/kernel) across the image, performing element-wise
multiplication and summation.
Example: 3×3 Filter on 5×5 Image
Result: Feature Map showing edges in the image
Multiple Filters:
Modern CNNs use many filters per layer:
First layer: 32-64 filters learning basic patterns (edges, colors)
Middle layers: 128-512 filters learning textures and patterns
Deep layers: 512+ filters learning complex objects
Calculation Example:
Input: 32×32×3 (color image)
Filter: 5×5×3 with 64 filters
Output: 28×28×64
Why 28×28?
Output size = (Input - Filter + 2×Padding) / Stride + 1
= (32 - 5 + 0) / 1 + 1
= 28
Parameters in this layer:
Each filter: 5×5×3 = 75 weights + 1 bias = 76 parameters
Total: 76 × 64 filters = 4,864 parameters
Compare to fully connected: 32×32×3 × output_size = millions!
3.3 Padding: Preserving Spatial Dimensions
The Shrinking Problem:
Without padding, each convolution reduces image size:
32×32 → 28×28 → 24×24 → ...
After a few layers, you're left with tiny feature maps!
Types of Padding:
Valid Padding (No Padding):
Output size decreases
Edge pixels used less frequently
Same Padding:
Add zeros around the border
Output size = Input size (with stride=1)
Every pixel is considered equally
Example:
Original 5×5 image with padding=1:
0000000
0111000
0011100
0001110
0001100
0011000
0000000
Now you can apply 3×3 filter and maintain 5×5 output!
3.4 Stride: Controlling Output Size
Stride = Step Size:
Stride = 1: Slide filter one pixel at a time (default)
Stride = 2: Skip every other position
Stride = 3: Skip two positions
Impact on Output:
Input: 32×32
Filter: 3×3
Padding: 0
Stride=1: Output = (32-3)/1 + 1 = 30×30
Stride=2: Output = (32-3)/2 + 1 = 15×15
Stride=3: Output = (32-3)/3 + 1 = 10×10
Trade-off:
Larger stride: Smaller output, fewer computations, might miss details
Smaller stride: Larger output, more computations, captures fine details
Common Practice: Use stride=1 in convolutional layers, use pooling layers for downsampling
3.5 Pooling Operations: Dimension Reduction
Purpose:
1. Reduce spatial dimensions (downsampling)
2. Reduce computational load
3. Make features more robust to small translations
4. Prevent overfitting
Max Pooling (Most Common):
Take the maximum value in each region.
Example: 2×2 Max Pooling with Stride=2
Input (4×4): Output (2×2):
1 3 2 4
2 1 8 3 → 3 8
5 6 1 2 6 7
4 2 7 1
Each 2×2 region becomes a single value (the max)
Intuition: "Is this feature present anywhere in this region?"
Average Pooling:
Takes the average instead of maximum. Less common but useful for:
Global context
Smoother downsampling
Retaining all information (not just the maximum)
Example: 2×2 Average Pooling
Input: Output:
1 3 2 4 (1+3+2+1)/4 = 1.75 (2+4+8+3)/4 = 4.25
2 1 8 3 → (5+6+4+2)/4 = 4.25 (1+2+7+1)/4 = 2.75
5 6 1 2
4 2 7 1
Global Average Pooling:
Average the entire feature map into a single value per channel. Used before the final classification layer.
3.6 Complete CNN Architecture
Typical CNN Structure:
Input Image (224×224×3)
↓
[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn low-level features)
[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn mid-level features)
[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn high-level features)
[FLATTEN]
↓
[Fully Connected + ReLU] → [Fully Connected + ReLU]
↓
[Output Layer + Softmax]
Example: LeNet-5 (Classic Architecture)
Designed by Yann LeCun for handwritten digit recognition:
Input: 32×32×1 (grayscale)
↓
Conv1: 6 filters (5×5) → 28×28×6 → ReLU
Pool1: Max pooling (2×2) → 14×14×6
↓
Conv2: 16 filters (5×5) → 10×10×16 → ReLU
Pool2: Max pooling (2×2) → 5×5×16
↓
Flatten: 5×5×16 = 400 neurons
FC1: 120 neurons → ReLU
FC2: 84 neurons → ReLU
Output: 10 neurons (digits 0-9) → Softmax
Modern CNN Architecture Principles:
1. Deeper is better (to a point): 20-200+ layers in modern networks
2. Small filters work best: Mostly 3×3, sometimes 5×5
3. Increase channels as you go deeper: 64 → 128 → 256 → 512
4. Reduce spatial size gradually: Use pooling or strided convolutions
5. Add skip connections: Helps training very deep networks (ResNet)
3.7 Famous CNN Architectures
AlexNet (2012):
8 layers, 60 million parameters
Won ImageNet competition by huge margin
Popularized ReLU and Dropout
VGGNet (2014):
16-19 layers
Uses only 3×3 filters
Simple and uniform architecture
138 million parameters
ResNet (2015):
Up to 152 layers!
Introduces skip connections
Solves vanishing gradient in very deep networks
Calculation Example: Parameters in Conv Layer
Input: 64×64×3
Conv Layer: 32 filters of size 3×3
Parameters per filter = 3×3×3 (input channels) + 1 (bias) = 28
Total parameters = 28 × 32 filters = 896
Compare to fully connected:
64×64×3 × 32 = 393,216 parameters!
CNN wins: 896 vs 393,216 (440× fewer parameters)
3.8 Hands-On: Building a CNN for Image Classification
Problem: Classify images into 10 categories (CIFAR-10 dataset)
Architecture Design:
Input: 32×32×3 (RGB images)
Block 1:
Conv2D(32 filters, 3×3) → ReLU → BatchNorm
Conv2D(32 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 16×16×32
Block 2:
Conv2D(64 filters, 3×3) → ReLU → BatchNorm
Conv2D(64 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 8×8×64
Block 3:
Conv2D(128 filters, 3×3) → ReLU → BatchNorm
Conv2D(128 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 4×4×128
Flatten: 4×4×128 = 2048
Dense(256) → ReLU → Dropout(0.5)
Dense(10) → Softmax
Training Strategy:
1. Data Augmentation: Random flips, rotations, crops
2. Optimizer: Adam with learning rate = 0.001
3. Batch Size: 64
4. Epochs: 50-100
5. Early Stopping: Stop if validation accuracy doesn't improve for 10 epochs
Expected Results:
Training Accuracy: ~90%
Validation Accuracy: ~75-80%
Common mistakes: Confusing similar categories (cat/dog, truck/automobile)
UNIT IV: RECURRENT NEURAL NETWORK (9 Hours)
4.1 Why RNNs? Understanding Sequential Data
The Sequential Data Problem:
Many real-world problems involve sequences:
Text: "The cat sat on the ___" (next word depends on previous words)
Speech: Audio signal over time
Stock Prices: Today's price influenced by past prices
Weather: Tomorrow's weather depends on recent patterns
Video: Sequence of frames
Why CNNs and MLPs Fail:
1. Fixed Input Size: Can't handle variable-length sequences
2. No Memory: Each prediction is independent
3. No Temporal Order: Position information is lost
Example Problem:
Sentiment Analysis: "The movie was not good"
An MLP sees: {movie, was, not, good}
Might predict: Positive (because "good" is positive)
An RNN understands: "not good" = negative
Correctly predicts: Negative
The Core Idea:
RNNs maintain a hidden state that acts as memory, carrying information from previous time steps.
RNN Cell Structure:
At each time step t:
Input: x_t (current input)
Previous hidden state: h_(t-1)
Compute:
h_t = tanh(W_hh × h_(t-1) + W_xh × x_t + b_h)
y_t = W_hy × h_t + b_y
Output: y_t
New hidden state: h_t (passed to next time step)
Visual Flow:
Time: t=0 t=1 t=2 t=3
Input: x_0 → x_1 → x_2 → x_3
↓ ↓ ↓ ↓
Hidden: h_0 → h_1 → h_2 → h_3
↓ ↓ ↓ ↓
Output: y_0 y_1 y_2 y_3
Key Insight: The same weights (W_hh, W_xh, W_hy) are used at every time step! This is called parameter
sharing.
4.3 Mathematical Example: Character-Level RNN
Task: Predict the next character in "HELLO"
Vocabulary: {H, E, L, O} → One-hot encoded
H = [1, 0, 0, 0]
E = [0, 1, 0, 0]
L = [0, 0, 1, 0]
O = [0, 0, 0, 1]
RNN Parameters (simplified):
Hidden size: 3
W_xh: 4×3 matrix (input to hidden)
W_hh: 3×3 matrix (hidden to hidden)
W_hy: 3×4 matrix (hidden to output)
Forward Pass Example:
Time step 0: Input = 'H'
x_0 = [1, 0, 0, 0]
h_-1 = [0, 0, 0] (initialized to zeros)
h_0 = tanh(W_hh × [0,0,0] + W_xh × [1,0,0,0] + b_h)
= tanh([0.2, -0.1, 0.3]) = [0.197, -0.099, 0.291]
y_0 = softmax(W_hy × h_0 + b_y)
= [0.1, 0.6, 0.2, 0.1]
Prediction: 'E' (highest probability)
Actual next character: 'E' ✓
Time step 1: Input = 'E'
x_1 = [0, 1, 0, 0]
h_0 = [0.197, -0.099, 0.291] (from previous step)
h_1 = tanh(W_hh × h_0 + W_xh × x_1 + b_h)
= [0.15, 0.22, -0.08]
y_1 = softmax(W_hy × h_1 + b_y)
= [0.05, 0.15, 0.75, 0.05]
Prediction: 'L' ✓
The hidden state h_0 carries information about 'H' forward, allowing the network to understand context!
4.4 Comparing RNNs and Standard Neural Networks
RNN vs MLP:
RNN vs CNN:
Aspect CNN RNN
Structure Spatial hierarchy Temporal sequence
Translation Invariant Position matters
Best for Images Text, time series
Parallelization Easy Difficult
4.5 Types of RNN Architectures
1. One-to-One (Standard Neural Network):
Single input → Single output
Example: Image classification (one image → one label)
2. One-to-Many:
Single input → Sequence output
Example: Image captioning (image → sentence)
Image → [RNN] → "A" → "cat" → "sitting" → "on" → "mat"
3. Many-to-One:
Sequence input → Single output
Example: Sentiment analysis (sentence → positive/negative)
"This" → "movie" → "is" → "great" → [RNN] → Positive
4. Many-to-Many (Same length):
Sequence input → Sequence output (aligned)
Example: Video frame labeling (frame 1 label, frame 2 label, ...)
5. Many-to-Many (Different lengths):
Sequence input → Sequence output (not aligned)
Example: Machine translation (English sentence → French sentence)
"Hello world" → [Encoder RNN] → [Decoder RNN] → "Bonjour monde"
4.6 Components of an RNN Cell
Weights in RNN:
1. W_xh (Input-to-Hidden): Transforms input to hidden state space
2. W_hh (Hidden-to-Hidden): Maintains temporal dependencies
3. W_hy (Hidden-to-Output): Produces output from hidden state
Activation Functions:
Hidden State: Typically tanh (range: -1 to 1)
Zero-centered helps with gradient flow
Output: Depends on task
Classification: Softmax
Regression: Linear
Binary: Sigmoid
Bias Terms:
b_h: Added to hidden state computation
b_y: Added to output computation
4.7 Training RNNs: Backpropagation Through Time (BPTT)
The Challenge:
RNNs have temporal dependencies, so we need to backpropagate through all time steps.
BPTT Algorithm:
1. Unroll the RNN through time
2. Forward pass through all time steps
3. Compute loss at relevant time steps
4. Backward pass from final to initial time step
5. Accumulate gradients across all time steps
6. Update weights using accumulated gradients
Mathematical Flow:
Forward: t=0 → t=1 → t=2 → ... → t=T
Backward: t=T → ... → t=2 → t=1 → t=0
At each time step going backward:
∂L/∂h_t = ∂L/∂h_(t+1) × ∂h_(t+1)/∂h_t + ∂L/∂y_t × ∂y_t/∂h_t
Gradient Accumulation:
∂L/∂W_hh = Σ (∂L/∂h_t × ∂h_t/∂W_hh) for all t
This sum considers how W_hh affects loss at ALL time steps
4.8 The Vanishing and Exploding Gradient Problem
The Core Issue:
When backpropagating through many time steps, gradients can become extremely small (vanish) or extremely
large (explode).
Why This Happens:
Vanishing Gradient:
∂h_t/∂h_0 = ∂h_t/∂h_(t-1) × ∂h_(t-1)/∂h_(t-2) × ... × ∂h_1/∂h_0
If each term < 1, the product approaches 0 exponentially:
0.5 × 0.5 × 0.5 × ... (50 times) ≈ 0.0000000000000009
Result: Network can't learn long-term dependencies
Exploding Gradient:
If each term > 1, the product grows exponentially:
1.5 × 1.5 × 1.5 × ... (50 times) ≈ 637,621
Result: Weights update by huge amounts, training becomes unstable
Real-World Impact:
Consider: "The cat, which was given to me by my friend who lives in France, was very cute"
Task: Predict the verb ("was")
Problem: The subject "cat" is 15 words away. Vanilla RNNs struggle with this distance!
Solutions:
1. Gradient Clipping (for exploding):
if ||gradient|| > threshold:
gradient = gradient × (threshold / ||gradient||)
2. Better Architectures: LSTM and GRU (next section!)
3. Better Initialization: Xavier/He initialization
4. Better Activation Functions: ReLU instead of tanh (sometimes)
4.9 Hands-On Example: Sentiment Analysis with RNN
Problem: Classify movie reviews as positive or negative
Example Reviews:
"This movie was absolutely fantastic!" → Positive
"Waste of time and money" → Negative
Architecture:
Input: Sequence of word embeddings
↓
Embedding Layer (convert words to vectors)
each word → 128-dimensional vector
↓
RNN Layer (64 hidden units)
processes sequence, maintains context
↓
Take final hidden state h_T
↓
Dense Layer (32 neurons) → ReLU
↓
Output (1 neuron) → Sigmoid
output > 0.5 = Positive
output ≤ 0.5 = Negative
Training Process:
Example: "Great movie"
Step 1: Word to index
"Great" → 342, "movie" → 89
Step 2: Embedding
342 → [0.23, -0.45, ..., 0.12] (128 dims)
89 → [0.67, 0.34, ..., -0.23] (128 dims)
Step 3: RNN Processing
t=0: Input="Great" embedding, h_0=zeros
→ h_1 = [0.45, -0.23, ..., 0.67]
t=1: Input="movie" embedding, h_1 from previous
→ h_2 = [0.78, 0.34, ..., -0.12]
Step 4: Classification
Dense(h_2) → [0.34, -0.23, ...]
Sigmoid → 0.87 → Positive ✓
UNIT V: LONG SHORT-TERM MEMORY (LSTM) (9 Hours)
5.1 Limitations of Vanilla RNNs
The Long-Term Dependency Problem:
Vanilla RNNs struggle to connect information over long sequences due to vanishing gradients.
Example Problems:
1. Language Modeling:
"I grew up in France... [50 words later]... I speak fluent _____"
Vanilla RNN: Likely forgets "France" by this point
Correct answer: "French"
2. Question Answering:
Context: "John was born in 1990. He studied medicine. [Many sentences]"
Question: "When was John born?"
RNN needs to remember "1990" through all the context
Why This Matters:
Most real-world sequences have long-term dependencies. We need a better memory mechanism!
5.2 LSTM Architecture: Solving the Memory Problem
The Key Innovation:
LSTMs introduce a cell state that runs through the entire sequence with minimal modifications. Think of it as a
"memory highway" that allows information to flow unchanged.
LSTM Components:
1. Cell State (C_t): The long-term memory
2. Hidden State (h_t): The short-term memory (output)
3. Three Gates: Control information flow
5.3 The Three Gates of LSTM
1. Forget Gate (f_t):
Decides what information to throw away from the cell state.
f_t = σ(W_f × [h_(t-1), x_t] + b_f)
Output: Values between 0 and 1
- 0 = "completely forget this"
- 1 = "completely keep this"
Example:
Context: "The cat was hungry. The dog..."
When processing "dog", forget gate:
- Forgets subject "cat" (output ≈ 0.1)
- Prepares for new subject "dog"
2. Input Gate (i_t):
Decides what new information to store in the cell state.
i_t = σ(W_i × [h_(t-1), x_t] + b_i)
C̃_t = tanh(W_C × [h_(t-1), x_t] + b_C)
i_t: How much to update (0 to 1)
C̃_t: Candidate values to add (-1 to 1)
Example:
Processing "dog":
- Input gate opens (i_t ≈ 0.9)
- Candidate: Information about "dog" being new subject
- Adds this to cell state
3. Output Gate (o_t):
Decides what to output based on cell state.
o_t = σ(W_o × [h_(t-1), x_t] + b_o)
h_t = o_t * tanh(C_t)
o_t: What parts to output
h_t: Final hidden state (filtered cell state)
5.4 LSTM Forward Pass: Complete Mathematical Example
Given:
Previous hidden state: h_(t-1) = [0.5, -0.3]
Previous cell state: C_(t-1) = [0.8, 0.4]
Current input: x_t = [1.0, 0.0]
(Using simplified 2D vectors for clarity)
Step 1: Forget Gate
f_t = σ(W_f × [h_(t-1), x_t] + b_f)
= σ(W_f × [0.5, -0.3, 1.0, 0.0] + b_f)
= σ([0.4, 0.7])
= [0.60, 0.67]
Interpretation: Keep 60% of first memory, 67% of second memory
Step 2: Input Gate + Candidate
i_t = σ(W_i × [0.5, -0.3, 1.0, 0.0] + b_i)
= [0.75, 0.82]
C̃_t = tanh(W_C × [0.5, -0.3, 1.0, 0.0] + b_C)
= [0.45, -0.23]
Interpretation: Add 75% of first candidate, 82% of second candidate
Step 3: Update Cell State
C_t = f_t * C_(t-1) + i_t * C̃_t
= [0.60, 0.67] * [0.8, 0.4] + [0.75, 0.82] * [0.45, -0.23]
= [0.48, 0.268] + [0.338, -0.189]
= [0.818, 0.079]
Interpretation: New long-term memory combines old and new info
Step 4: Output Gate + Hidden State
o_t = σ(W_o × [0.5, -0.3, 1.0, 0.0] + b_o)
= [0.72, 0.68]
h_t = o_t * tanh(C_t)
= [0.72, 0.68] * tanh([0.818, 0.079])
= [0.72, 0.68] * [0.672, 0.079]
= [0.484, 0.054]
Interpretation: Output filtered version of cell state
Result:
New cell state C_t carries long-term memory
New hidden state h_t is passed to next layer/time step
5.5 How LSTM Solves Vanishing Gradient
The Cell State Highway:
C_t = f_t * C_(t-1) + i_t * C̃_t
Gradient flow:
∂C_t/∂C_(t-1) = f_t
Key insight: This is ADDITION, not multiplication!
Comparison:
Vanilla RNN:
∂h_t/∂h_0 = ∂h_t/∂h_(t-1) × ∂h_(t-1)/∂h_(t-2) × ... × ∂h_1/∂h_0
Multiple multiplications → vanishing gradient
LSTM:
∂C_t/∂C_0 = f_t × f_(t-1) × ... × f_1
But if forget gates ≈ 1 (keep information):
∂C_t/∂C_0 ≈ 1 × 1 × ... × 1 = 1
Gradient flows unchanged through time!
Real-World Impact:
LSTMs can learn dependencies spanning 100+ time steps, while vanilla RNNs struggle beyond 10-15 steps.
5.6 Comparing RNN and LSTM
Architecture Comparison:
Feature Vanilla RNN LSTM
Memory Single hidden state Cell state + hidden state
Gates None 3 (forget, input, output)
Parameters ~N² ~4N²
Long-term memory Poor Excellent
Training speed Fast Slower
When to Use Each:
Vanilla RNN:
Short sequences (< 30 time steps)
Real-time applications needing speed
Simple patterns
Example: Next character prediction in short words
LSTM:
Long sequences (> 30 time steps)
Complex dependencies
Better performance needed
Example: Document classification, machine translation
5.7 LSTM Variants
1. Peephole Connections:
Allow gates to look at the cell state directly.
f_t = σ(W_f × [C_(t-1), h_(t-1), x_t] + b_f)
Cell state directly influences gate decisions
2. Coupled Forget and Input Gates:
Simplification: When we forget, we must input something new, and vice versa.
f_t = σ(...)
i_t = 1 - f_t
Fewer parameters, sometimes works just as well
5.8 GRU (Gated Recurrent Unit): LSTM's Simpler Cousin
Key Differences from LSTM:
1. No separate cell state (only hidden state)
2. Two gates instead of three:
Reset gate (r_t)
Update gate (z_t)
3. Fewer parameters: ~3N² instead of 4N²
GRU Equations:
Reset Gate:
r_t = σ(W_r × [h_(t-1), x_t])
Update Gate:
z_t = σ(W_z × [h_(t-1), x_t])
Candidate Hidden State:
h̃_t = tanh(W × [r_t * h_(t-1), x_t])
Final Hidden State:
h_t = (1 - z_t) * h_(t-1) + z_t * h̃_t
Update Gate Intuition:
z_t = 0: Completely ignore new input, keep old hidden state
z_t = 1: Completely forget old state, use new candidate
z_t = 0.5: Balance between old and new
LSTM vs GRU:
Aspect LSTM GRU
Parameters More Fewer (25% less)
Training speed Slower Faster
Memory capacity Better Good
Performance Slight edge on complex tasks Similar on most tasks
Practical Advice: Try GRU first (faster training). If performance isn't good enough, try LSTM.
5.9 Components Breakdown: Forget, Input, Output Gates
Forget Gate Deep Dive:
Purpose: Selectively forget irrelevant information
Example - Subject Tracking:
Sentence: "The cat sat on the mat. The dog..."
At "cat": Cell state stores [subject=cat, number=singular, ...]
At "dog":
- Forget gate opens for "subject" (f_t ≈ 0.1)
- New subject "dog" can replace "cat"
- Keeps "number=singular" (f_t ≈ 0.9)
Input Gate Deep Dive:
Purpose: Decide what new information to add
Example - Sentiment Accumulation:
Review: "The movie had great acting but terrible plot"
At "great": i_t high, adds positive sentiment
At "terrible": i_t high, adds negative sentiment
Final: Balanced sentiment (mixed review)
Output Gate Deep Dive:
Purpose: Filter what information to output
Example - Next Word Prediction:
Input: "The capital of France is"
Cell state contains: [country=France, topic=geography, ...]
Output gate: Filters to output only "capital city" feature
Prediction: "Paris"
5.10 Hands-On: Text Generation with LSTM
Problem: Train LSTM to generate Shakespeare-style text
Architecture:
Input: Sequence of characters
↓
Embedding Layer: Character → Vector (128 dim)
↓
LSTM Layer 1: 256 units
↓
LSTM Layer 2: 256 units (stacked for more capacity)
↓
Dense Layer: vocab_size neurons (one per character)
↓
Softmax: Probability distribution over next character
Training Example:
Input sequence: "To be or not to" Target: Next character is " " (space)
Process character by character:
't' → h_1, C_1
'o' → h_2, C_2
' ' → h_3, C_3
...
'o' → h_14, C_14
At final 'o':
Output = softmax(Dense(h_14))
= [p('a')=0.02, p(' ')=0.65, p('b')=0.05, ...]
Loss = -log(p(' ')) = -log(0.65) = 0.43
Backpropagate through time, update weights
Generation Process:
Seed text: "To be"
Step 1: Input "To be" → Model outputs distribution
Sample character: ' ' (space)
Step 2: Input "To be " → Model outputs distribution
Sample character: 'o'
Step 3: Input "To be o" → Model outputs distribution
Sample character: 'r'
Continue until desired length...
Result: "To be or not to be, that is the question..."
Temperature Sampling:
Control randomness of generation:
High temperature (T=1.5): More random, creative
p_new = softmax(logits / T)
Low temperature (T=0.5): More conservative, predictable
5.11 Applications of LSTMs
1. Machine Translation:
English → [Encoder LSTM] → Context Vector → [Decoder LSTM] → French
"Hello world" → [0.34, -0.23, ...] → "Bonjour monde"
2. Speech Recognition:
Audio waveform → [Feature extraction] → [LSTM] → Text transcription
Handles variable-length audio naturally
3. Video Classification:
Frame 1, Frame 2, ..., Frame N → [LSTM] → Action label
Understands temporal patterns in video
4. Music Generation:
Seed melody → [LSTM] → Next note → Feed back → Next note → ...
Learns rhythm, harmony, and structure
5. Anomaly Detection:
Normal sequence: t1, t2, t3, ... → [LSTM predicts] → t4
If actual t4 very different from prediction → Anomaly!
Used in: Fraud detection, equipment monitoring, cybersecurity
Summary of All 5 Units
Unit I: Foundation
Deep learning basics and motivation
Perceptrons and basic neural networks
Activation functions and training concepts
Unit II: Training Deep Networks
Forward and backward propagation
Optimization algorithms
Multi-layer perceptrons and practical training
Unit III: Spatial Data (Images)
Convolutional neural networks
Filters, pooling, and feature hierarchies
CNN architectures for computer vision
Unit IV: Sequential Data (Basic)
Recurrent neural networks
Handling variable-length sequences
Temporal dependencies
Unit V: Sequential Data (Advanced)
LSTM architecture and gates
Solving vanishing gradients
Advanced sequence modeling
Key Formulas Reference
Perceptron:
output = step(Σ(w_i × x_i) + b)
Sigmoid:
σ(x) = 1 / (1 + e^(-x))
ReLU:
ReLU(x) = max(0, x)
Gradient Descent:
w_new = w_old - η × (∂L/∂w)
Convolution Output Size:
Output = (Input - Filter + 2×Padding) / Stride + 1
LSTM Forget Gate:
f_t = σ(W_f × [h_(t-1), x_t] + b_f)
LSTM Cell State Update:
C_t = f_t * C_(t-1) + i_t * C̃_t
Exam Preparation Tips
Theory Questions (80%):
1. Understand concepts deeply: Don't just memorize, understand WHY
2. Draw diagrams: Network architectures, data flow, gate mechanisms
3. Compare and contrast: RNN vs LSTM, CNN vs MLP, different activation functions
4. Real-world applications: Be able to explain where each architecture shines
5. Limitations: Know what each model cannot do well
Math Calculations (20%):
1. Practice forward pass: Given inputs and weights, compute outputs
2. Parameter counting: Calculate number of parameters in layers
3. Output size calculations: For convolutions and pooling
4. Simple backprop: Compute gradients for toy examples
5. Probability calculations: Softmax, cross-entropy loss
Common Exam Topics:
Explain vanishing gradient problem and LSTM solution
Design CNN architecture for given problem
Calculate convolution output dimensions
Compare different activation functions
Trace data flow through RNN time steps
Explain role of each LSTM gate with examples
Good luck with your exam! 🎓