0% found this document useful (0 votes)
34 views38 pages

Advanced AI - Complete Course Notes

The document outlines a comprehensive course on Advanced Artificial Intelligence, focusing on deep learning and neural networks over 45 hours. It covers foundational concepts such as perceptrons, neural network architecture, activation functions, training processes, and backpropagation. The course also discusses the advantages and disadvantages of deep learning, along with practical examples and applications across various industries.

Uploaded by

jeyajanaani08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views38 pages

Advanced AI - Complete Course Notes

The document outlines a comprehensive course on Advanced Artificial Intelligence, focusing on deep learning and neural networks over 45 hours. It covers foundational concepts such as perceptrons, neural network architecture, activation functions, training processes, and backpropagation. The course also discusses the advantages and disadvantages of deep learning, along with practical examples and applications across various industries.

Uploaded by

jeyajanaani08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Advanced Artificial Intelligence - Complete Course Notes

Course Code: EI6523803


Total Hours: 45 Hours (9 hours per unit)

UNIT I: INTRODUCTION TO DEEP LEARNING & PERCEPTRONS (9 Hours)


1.1 Understanding Deep Learning
What is Deep Learning?

Deep Learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence
"deep") to progressively extract higher-level features from raw input. Unlike traditional machine learning where
you manually engineer features, deep learning automatically learns hierarchical representations of data.

Real-World Example: When you upload a photo to Facebook and it automatically tags your friends, that's deep
learning at work. The network learns to recognize faces through multiple layers - first detecting edges, then
facial features like eyes and nose, then complete faces, and finally identifying specific individuals.

The Need for Deep Learning:

Traditional machine learning algorithms plateau when you feed them more data. Deep learning models,
especially neural networks, continue to improve their performance as you provide more training data. This
makes them ideal for:

Image and video recognition

Natural language processing

Speech recognition

Autonomous vehicles

Medical diagnosis

Applications Across Industries:

1. Healthcare: Detecting diseases from X-rays and MRI scans with accuracy matching or exceeding human
radiologists

2. Finance: Fraud detection, algorithmic trading, credit scoring

3. Retail: Recommendation systems (Netflix, Amazon), inventory management

4. Automotive: Self-driving cars using computer vision and sensor fusion

5. Entertainment: Deepfakes, content generation, game AI

1.2 Neural Networks: The Foundation


Biological Inspiration:
The artificial neural network mimics how neurons in your brain work. Each biological neuron:

Receives signals through dendrites

Processes the signal in the cell body

Sends output through the axon if the signal is strong enough

Similarly, an artificial neuron:

Receives weighted inputs

Sums them up and adds a bias

Applies an activation function

Outputs the result

Mathematical Representation:

For a single neuron:

z = (w₁ × x₁) + (w₂ × x₂) + ... + (wₙ × xₙ) + b


output = activation_function(z)

Where:

x₁, x₂, ..., xₙ are inputs

w₁, w₂, ..., wₙ are weights

b is bias

z is the weighted sum

Example Calculation:

Let's say you're predicting if a student will pass an exam based on:

Hours studied (x₁ = 5)

Previous test score (x₂ = 75)

With weights w₁ = 0.3, w₂ = 0.02, and bias b = -2:

z = (0.3 × 5) + (0.02 × 75) + (-2)


z = 1.5 + 1.5 - 2 = 1.0

1.3 Perceptron: The Simplest Neural Network


Concepts of Perceptron:
A perceptron is the fundamental building block of neural networks, invented by Frank Rosenblatt in 1958. It's a
binary classifier that makes decisions by weighing evidence.

How Perceptron Works:

1. Input Layer: Receives features

2. Weights: Each input has an associated weight showing its importance

3. Summation: Calculates weighted sum

4. Activation: Applies a step function (outputs 0 or 1)

Bias in Neural Networks:

Bias is like the y-intercept in a linear equation. It allows the activation function to be shifted left or right, which
is critical for fitting the data better.

Think of it this way: If you're deciding whether to go to a party:

Weights determine how much you care about each factor (friends going, location, time)

Bias represents your general enthusiasm for parties (some people are naturally more inclined to go)

Perceptron Learning Algorithm:

The perceptron learns through a simple update rule:

For each training example (x, y):


prediction = step_function(w · x + b)
if prediction ≠ y:
w = w + η × (y - prediction) × x
b = b + η × (y - prediction)

Where η (eta) is the learning rate (typically 0.01 to 0.3)

Practical Example:

Let's build a perceptron to classify if it's a good day for ice cream:

Input 1: Temperature (normalized 0-1)

Input 2: Is it sunny? (1=yes, 0=no)

Initial weights: w₁ = 0.5, w₂ = 0.3, bias = -0.4

Training Example 1: Temp=0.8, Sunny=1, Label=1 (Good day)

z = (0.5 × 0.8) + (0.3 × 1) - 0.4 = 0.4 + 0.3 - 0.4 = 0.3


prediction = step(0.3) = 1 ✓ (Correct!)
Training Example 2: Temp=0.2, Sunny=0, Label=0 (Bad day)

z = (0.5 × 0.2) + (0.3 × 0) - 0.4 = 0.1 - 0.4 = -0.3


prediction = step(-0.3) = 0 ✓ (Correct!)

1.4 Advantages and Disadvantages of Deep Learning


Advantages:

1. Automatic Feature Learning: No need for manual feature engineering

2. Scalability: Performance improves with more data

3. Versatility: Works across multiple domains (vision, speech, text)

4. Transfer Learning: Models trained on one task can be adapted to others

5. End-to-End Learning: Can learn directly from raw data to output

Disadvantages:

1. Data Hungry: Requires massive amounts of labeled data

2. Computational Cost: Needs powerful GPUs for training

3. Black Box Nature: Hard to interpret why a model made a specific decision

4. Overfitting Risk: Can memorize training data instead of learning patterns

5. Long Training Time: Complex models can take days or weeks to train

1.5 Logic Gates and Simple Networks


Implementing AND, OR, NAND, XOR with Perceptrons:

AND Gate:

Inputs: x₁, x₂ ∈ {0, 1}


Weights: w₁ = 1, w₂ = 1
Bias: b = -1.5

Truth Table:
x₁ | x₂ | z = x₁ + x₂ - 1.5 | output
0 | 0 | -1.5 |0
0 | 1 | -0.5 |0
1 | 0 | -0.5 |0
1 | 1 | 0.5 |1

OR Gate:
Weights: w₁ = 1, w₂ = 1
Bias: b = -0.5

The threshold is lower, so any 1 input triggers output

NAND Gate:

Weights: w₁ = -1, w₂ = -1
Bias: b = 1.5

Negative weights invert the AND gate

XOR Problem - The Perceptron's Limitation:

XOR (exclusive OR) cannot be solved by a single perceptron because it's not linearly separable. This was
discovered by Marvin Minsky and Seymour Papert in 1969, causing the first "AI Winter."

XOR Truth Table:

x₁ | x₂ | output
0 |0 |0
0 |1 |1
1 |0 |1
1 |1 |0

You cannot draw a single straight line to separate the 1s from the 0s on a 2D plot.

Solution: Multi-layer networks! By combining multiple perceptrons, we can solve XOR:

First layer: Create AND and NAND gates

Second layer: Combine with OR gate

This discovery led to the development of multi-layer perceptrons and backpropagation.

UNIT II: INTRODUCTION TO NEURAL NETWORKS & TRAINING (9 Hours)


2.1 Neural Network Architecture
From Perceptron to Neural Network:

A neural network is simply multiple perceptrons organized in layers:

1. Input Layer: Receives raw data

2. Hidden Layers: Perform intermediate computations (this is where the "deep" in deep learning comes from)
3. Output Layer: Produces final predictions

Why Multiple Layers Matter:

Each layer learns increasingly abstract representations:

Layer 1: In image recognition, might detect edges and curves

Layer 2: Combines edges into shapes and textures

Layer 3: Recognizes parts (eyes, wheels, windows)

Layer 4: Identifies complete objects (faces, cars, buildings)

2.2 Activation Functions


Activation functions introduce non-linearity into neural networks. Without them, no matter how many layers
you stack, the network would just be a linear function!

Step Function (Used in Perceptron):

f(x) = 1 if x ≥ 0
f(x) = 0 if x < 0

Problem: Not differentiable, can't use gradient descent

Sigmoid Function:

σ(x) = 1 / (1 + e^(-x))

Range: (0, 1)

Use Case: Binary classification, output probabilities

Problem: Vanishing gradient for extreme values

Gradient: σ'(x) = σ(x) × (1 - σ(x))

Example:

If x = 0: σ(0) = 1/(1+1) = 0.5


If x = 2: σ(2) = 1/(1+e^(-2)) ≈ 0.88
If x = -2: σ(-2) = 1/(1+e^2) ≈ 0.12

Tanh (Hyperbolic Tangent):

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Range: (-1, 1)
Advantage: Zero-centered, making learning easier

Problem: Still suffers from vanishing gradient

ReLU (Rectified Linear Unit):

ReLU(x) = max(0, x)

Range: [0, ∞)

Advantages:
Simple and fast to compute

Doesn't saturate for positive values

Sparse activation (many neurons output zero)

Problem: Dying ReLU (neurons can get stuck at zero)

Most popular: Used in ~90% of modern deep networks

Example:

ReLU(-3) = 0
ReLU(0) = 0
ReLU(5) = 5

Softmax (Output Layer for Multi-class):

softmax(xi) = e^(xi) / Σ(e^(xj))

Converts logits into probabilities that sum to 1.

Example: Three-class classification

Logits: [2.0, 1.0, 0.1]

e^2.0 = 7.39
e^1.0 = 2.72
e^0.1 = 1.11
Sum = 11.22

Probabilities:
Class 1: 7.39/11.22 = 0.659 (65.9%)
Class 2: 2.72/11.22 = 0.242 (24.2%)
Class 3: 1.11/11.22 = 0.099 (9.9%)
The Training Problem:

Training a neural network means finding the best weights and biases that minimize prediction errors. This is an
optimization problem.

Loss Functions:

Mean Squared Error (MSE) - Regression:

MSE = (1/n) Σ(yi - ŷi)²

Where:
yi = actual value
ŷi = predicted value
n = number of samples

Cross-Entropy Loss - Classification:

L = -Σ yi × log(ŷi)

For binary classification:


L = -[y × log(ŷ) + (1-y) × log(1-ŷ)]

Gradient Descent Algorithm:

Imagine you're hiking down a mountain in fog. You can only see a few feet around you. How do you reach the
bottom? Take steps in the steepest downward direction!

Update Rule:

w_new = w_old - η × (∂L/∂w)

Where:
η = learning rate (step size)
∂L/∂w = gradient (slope of loss function)

Learning Rate Importance:

Too small: Training is painfully slow

Too large: You might overshoot the minimum and diverge

Just right: Typically 0.001 to 0.1

Example Calculation:

Let's minimize f(x) = x² starting at x = 10, learning rate η = 0.1


Iteration 1:
f'(10) = 2×10 = 20
x_new = 10 - 0.1×20 = 8

Iteration 2:
f'(8) = 2×8 = 16
x_new = 8 - 0.1×16 = 6.4

Iteration 3:
f'(6.4) = 12.8
x_new = 6.4 - 0.1×12.8 = 5.12

... continues until x ≈ 0

Variants of Gradient Descent:

1. Batch Gradient Descent: Uses entire dataset for each update (slow but stable)

2. Stochastic Gradient Descent (SGD): Uses one sample at a time (fast but noisy)

3. Mini-Batch Gradient Descent: Uses small batches (best of both worlds) - typically 32, 64, or 128 samples

2.4 Backpropagation: The Heart of Neural Network Training


The Chain Rule Connection:

Backpropagation applies the chain rule of calculus to compute gradients efficiently. It answers: "How much did
each weight contribute to the final error?"

Forward Pass:

1. Input data flows through the network

2. Each layer computes activations

3. Final layer produces predictions

4. Compute loss

Backward Pass:

1. Start from the loss

2. Compute gradient of loss with respect to output

3. Propagate gradients backward through each layer

4. Update weights using these gradients

Mathematical Flow:

For a simple network: Input → Hidden → Output


Forward:
h = σ(W1 × x + b1) [hidden layer]
y = σ(W2 × h + b2) [output layer]
L = (y - target)² [loss]

Backward:
∂L/∂W2 = ∂L/∂y × ∂y/∂W2
∂L/∂W1 = ∂L/∂y × ∂y/∂h × ∂h/∂W1

Concrete Example:

Network: 2 inputs → 2 hidden → 1 output

Given:

Input: [0.5, 0.8]

Target: 0.9

Weights randomly initialized

Forward Pass:

h1 = σ(0.5×w11 + 0.8×w12) = 0.6


h2 = σ(0.5×w21 + 0.8×w22) = 0.7
output = σ(0.6×w31 + 0.7×w32) = 0.5

Loss = (0.9 - 0.5)² = 0.16

Backward Pass:

Calculate how output error affects each weight


Update: w_new = w_old - η × gradient

The beauty is that this scales to millions of parameters!

2.5 Multi-Layer Perceptron (MLP) Architecture


Structure:

An MLP consists of:

Input layer: One neuron per feature

Hidden layers: Typically 1-5 layers with 10-1000 neurons each

Output layer: Depends on task (1 for regression, K for K-class classification)

Fully Connected (Dense) Layers:


Every neuron in one layer connects to every neuron in the next layer.

Example Architecture for MNIST Digit Recognition:

Input: 784 neurons (28×28 pixel image)


Hidden 1: 128 neurons with ReLU
Hidden 2: 64 neurons with ReLU
Output: 10 neurons with Softmax (digits 0-9)

Total parameters ≈ 100,000+

When to Use MLPs:

Tabular data (structured data with rows and columns)

When you don't need spatial/sequential structure

Quick baseline models

Limitations:

Doesn't preserve spatial structure (bad for images)

Can't handle variable-length inputs (bad for text)

Requires fixed-size inputs

2.6 Hands-On Training Process


Complete Training Loop:

python
# Pseudocode for training a neural network

for epoch in range(num_epochs):


for batch in training_data:
# 1. Forward Pass
predictions = model([Link])

# 2. Compute Loss
loss = loss_function(predictions, [Link])

# 3. Backward Pass
gradients = backpropagate(loss)

# 4. Update Weights
[Link]([Link], gradients)

# 5. Validation
val_loss = evaluate(model, validation_data)
print(f"Epoch {epoch}: Train Loss={loss}, Val Loss={val_loss}")

Key Training Concepts:

Epochs: One complete pass through the entire training dataset

Batch Size: Number of samples processed before updating weights

Overfitting vs Underfitting:

Underfitting: Model too simple, high training and validation error

Good Fit: Low training and validation error

Overfitting: Model memorizes training data, low training error but high validation error

Prevention Techniques:

1. Dropout: Randomly disable neurons during training

2. L2 Regularization: Add penalty for large weights

3. Early Stopping: Stop training when validation loss stops improving

4. Data Augmentation: Create variations of training data

UNIT III: CONVOLUTIONAL NEURAL NETWORK (9 Hours)


3.1 Why CNNs? The Problem with Regular Neural Networks for Images
The Dimensionality Challenge:
Consider a small 28×28 grayscale image:

Using MLP: 784 input neurons

Add color (RGB): 28×28×3 = 2,352 neurons

High-resolution image (1000×1000×3): 3,000,000 input neurons!

For a 1000-neuron hidden layer:

Parameters = 3,000,000 × 1,000 = 3 billion weights

This is just the first layer!

Problems with MLPs for Images:

1. Too many parameters → Overfitting, memory issues

2. No spatial awareness → Doesn't understand that nearby pixels are related

3. Not translation invariant → Can't recognize an object if it moves slightly

CNN Solution:

CNNs exploit three key ideas:

1. Local Connectivity: Neurons only connect to a small region

2. Parameter Sharing: Use the same filter across the entire image

3. Spatial Hierarchy: Build complex features from simple ones

3.2 Convolutional Layers: The Core Building Block


How Convolution Works:

Think of convolution as sliding a small window (filter/kernel) across the image, performing element-wise
multiplication and summation.

Example: 3×3 Filter on 5×5 Image


Result: Feature Map showing edges in the image

Multiple Filters:
Modern CNNs use many filters per layer:

First layer: 32-64 filters learning basic patterns (edges, colors)

Middle layers: 128-512 filters learning textures and patterns

Deep layers: 512+ filters learning complex objects

Calculation Example:

Input: 32×32×3 (color image)


Filter: 5×5×3 with 64 filters
Output: 28×28×64

Why 28×28?

Output size = (Input - Filter + 2×Padding) / Stride + 1


= (32 - 5 + 0) / 1 + 1
= 28

Parameters in this layer:

Each filter: 5×5×3 = 75 weights + 1 bias = 76 parameters


Total: 76 × 64 filters = 4,864 parameters

Compare to fully connected: 32×32×3 × output_size = millions!


3.3 Padding: Preserving Spatial Dimensions

The Shrinking Problem:

Without padding, each convolution reduces image size:

32×32 → 28×28 → 24×24 → ...

After a few layers, you're left with tiny feature maps!

Types of Padding:

Valid Padding (No Padding):

Output size decreases

Edge pixels used less frequently

Same Padding:

Add zeros around the border

Output size = Input size (with stride=1)

Every pixel is considered equally

Example:

Original 5×5 image with padding=1:

0000000
0111000
0011100
0001110
0001100
0011000
0000000

Now you can apply 3×3 filter and maintain 5×5 output!

3.4 Stride: Controlling Output Size


Stride = Step Size:

Stride = 1: Slide filter one pixel at a time (default)

Stride = 2: Skip every other position

Stride = 3: Skip two positions

Impact on Output:
Input: 32×32
Filter: 3×3
Padding: 0

Stride=1: Output = (32-3)/1 + 1 = 30×30


Stride=2: Output = (32-3)/2 + 1 = 15×15
Stride=3: Output = (32-3)/3 + 1 = 10×10

Trade-off:

Larger stride: Smaller output, fewer computations, might miss details

Smaller stride: Larger output, more computations, captures fine details

Common Practice: Use stride=1 in convolutional layers, use pooling layers for downsampling

3.5 Pooling Operations: Dimension Reduction


Purpose:

1. Reduce spatial dimensions (downsampling)

2. Reduce computational load

3. Make features more robust to small translations

4. Prevent overfitting

Max Pooling (Most Common):

Take the maximum value in each region.

Example: 2×2 Max Pooling with Stride=2

Input (4×4): Output (2×2):


1 3 2 4
2 1 8 3 → 3 8
5 6 1 2 6 7
4 2 7 1

Each 2×2 region becomes a single value (the max)

Intuition: "Is this feature present anywhere in this region?"

Average Pooling:

Takes the average instead of maximum. Less common but useful for:

Global context

Smoother downsampling
Retaining all information (not just the maximum)

Example: 2×2 Average Pooling

Input: Output:
1 3 2 4 (1+3+2+1)/4 = 1.75 (2+4+8+3)/4 = 4.25
2 1 8 3 → (5+6+4+2)/4 = 4.25 (1+2+7+1)/4 = 2.75
5 6 1 2
4 2 7 1

Global Average Pooling:

Average the entire feature map into a single value per channel. Used before the final classification layer.

3.6 Complete CNN Architecture


Typical CNN Structure:

Input Image (224×224×3)



[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn low-level features)
[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn mid-level features)
[CONV + ReLU] → [CONV + ReLU] → [POOL]
↓ (Learn high-level features)
[FLATTEN]

[Fully Connected + ReLU] → [Fully Connected + ReLU]

[Output Layer + Softmax]

Example: LeNet-5 (Classic Architecture)

Designed by Yann LeCun for handwritten digit recognition:


Input: 32×32×1 (grayscale)

Conv1: 6 filters (5×5) → 28×28×6 → ReLU
Pool1: Max pooling (2×2) → 14×14×6

Conv2: 16 filters (5×5) → 10×10×16 → ReLU
Pool2: Max pooling (2×2) → 5×5×16

Flatten: 5×5×16 = 400 neurons
FC1: 120 neurons → ReLU
FC2: 84 neurons → ReLU
Output: 10 neurons (digits 0-9) → Softmax

Modern CNN Architecture Principles:

1. Deeper is better (to a point): 20-200+ layers in modern networks

2. Small filters work best: Mostly 3×3, sometimes 5×5

3. Increase channels as you go deeper: 64 → 128 → 256 → 512

4. Reduce spatial size gradually: Use pooling or strided convolutions

5. Add skip connections: Helps training very deep networks (ResNet)

3.7 Famous CNN Architectures


AlexNet (2012):

8 layers, 60 million parameters

Won ImageNet competition by huge margin

Popularized ReLU and Dropout

VGGNet (2014):

16-19 layers

Uses only 3×3 filters

Simple and uniform architecture

138 million parameters

ResNet (2015):

Up to 152 layers!

Introduces skip connections

Solves vanishing gradient in very deep networks


Calculation Example: Parameters in Conv Layer

Input: 64×64×3
Conv Layer: 32 filters of size 3×3

Parameters per filter = 3×3×3 (input channels) + 1 (bias) = 28


Total parameters = 28 × 32 filters = 896

Compare to fully connected:


64×64×3 × 32 = 393,216 parameters!

CNN wins: 896 vs 393,216 (440× fewer parameters)

3.8 Hands-On: Building a CNN for Image Classification


Problem: Classify images into 10 categories (CIFAR-10 dataset)

Architecture Design:

Input: 32×32×3 (RGB images)

Block 1:
Conv2D(32 filters, 3×3) → ReLU → BatchNorm
Conv2D(32 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 16×16×32

Block 2:
Conv2D(64 filters, 3×3) → ReLU → BatchNorm
Conv2D(64 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 8×8×64

Block 3:
Conv2D(128 filters, 3×3) → ReLU → BatchNorm
Conv2D(128 filters, 3×3) → ReLU → BatchNorm
MaxPool(2×2)
Output: 4×4×128

Flatten: 4×4×128 = 2048


Dense(256) → ReLU → Dropout(0.5)
Dense(10) → Softmax

Training Strategy:

1. Data Augmentation: Random flips, rotations, crops


2. Optimizer: Adam with learning rate = 0.001

3. Batch Size: 64

4. Epochs: 50-100

5. Early Stopping: Stop if validation accuracy doesn't improve for 10 epochs

Expected Results:

Training Accuracy: ~90%

Validation Accuracy: ~75-80%

Common mistakes: Confusing similar categories (cat/dog, truck/automobile)

UNIT IV: RECURRENT NEURAL NETWORK (9 Hours)


4.1 Why RNNs? Understanding Sequential Data
The Sequential Data Problem:
Many real-world problems involve sequences:

Text: "The cat sat on the ___" (next word depends on previous words)

Speech: Audio signal over time

Stock Prices: Today's price influenced by past prices

Weather: Tomorrow's weather depends on recent patterns

Video: Sequence of frames

Why CNNs and MLPs Fail:

1. Fixed Input Size: Can't handle variable-length sequences

2. No Memory: Each prediction is independent

3. No Temporal Order: Position information is lost

Example Problem:

Sentiment Analysis: "The movie was not good"

An MLP sees: {movie, was, not, good}


Might predict: Positive (because "good" is positive)

An RNN understands: "not good" = negative


Correctly predicts: Negative
The Core Idea:

RNNs maintain a hidden state that acts as memory, carrying information from previous time steps.

RNN Cell Structure:

At each time step t:


Input: x_t (current input)
Previous hidden state: h_(t-1)

Compute:
h_t = tanh(W_hh × h_(t-1) + W_xh × x_t + b_h)
y_t = W_hy × h_t + b_y

Output: y_t
New hidden state: h_t (passed to next time step)

Visual Flow:

Time: t=0 t=1 t=2 t=3


Input: x_0 → x_1 → x_2 → x_3
↓ ↓ ↓ ↓
Hidden: h_0 → h_1 → h_2 → h_3
↓ ↓ ↓ ↓
Output: y_0 y_1 y_2 y_3

Key Insight: The same weights (W_hh, W_xh, W_hy) are used at every time step! This is called parameter
sharing.

4.3 Mathematical Example: Character-Level RNN


Task: Predict the next character in "HELLO"

Vocabulary: {H, E, L, O} → One-hot encoded

H = [1, 0, 0, 0]
E = [0, 1, 0, 0]
L = [0, 0, 1, 0]
O = [0, 0, 0, 1]

RNN Parameters (simplified):


Hidden size: 3
W_xh: 4×3 matrix (input to hidden)
W_hh: 3×3 matrix (hidden to hidden)
W_hy: 3×4 matrix (hidden to output)

Forward Pass Example:

Time step 0: Input = 'H'

x_0 = [1, 0, 0, 0]
h_-1 = [0, 0, 0] (initialized to zeros)

h_0 = tanh(W_hh × [0,0,0] + W_xh × [1,0,0,0] + b_h)


= tanh([0.2, -0.1, 0.3]) = [0.197, -0.099, 0.291]

y_0 = softmax(W_hy × h_0 + b_y)


= [0.1, 0.6, 0.2, 0.1]

Prediction: 'E' (highest probability)


Actual next character: 'E' ✓

Time step 1: Input = 'E'

x_1 = [0, 1, 0, 0]
h_0 = [0.197, -0.099, 0.291] (from previous step)

h_1 = tanh(W_hh × h_0 + W_xh × x_1 + b_h)


= [0.15, 0.22, -0.08]

y_1 = softmax(W_hy × h_1 + b_y)


= [0.05, 0.15, 0.75, 0.05]

Prediction: 'L' ✓

The hidden state h_0 carries information about 'H' forward, allowing the network to understand context!

4.4 Comparing RNNs and Standard Neural Networks


RNN vs MLP:
RNN vs CNN:

Aspect CNN RNN

Structure Spatial hierarchy Temporal sequence

Translation Invariant Position matters

Best for Images Text, time series

Parallelization Easy Difficult

4.5 Types of RNN Architectures


1. One-to-One (Standard Neural Network):

Single input → Single output


Example: Image classification (one image → one label)

2. One-to-Many:

Single input → Sequence output


Example: Image captioning (image → sentence)
Image → [RNN] → "A" → "cat" → "sitting" → "on" → "mat"

3. Many-to-One:

Sequence input → Single output


Example: Sentiment analysis (sentence → positive/negative)
"This" → "movie" → "is" → "great" → [RNN] → Positive

4. Many-to-Many (Same length):

Sequence input → Sequence output (aligned)


Example: Video frame labeling (frame 1 label, frame 2 label, ...)

5. Many-to-Many (Different lengths):


Sequence input → Sequence output (not aligned)
Example: Machine translation (English sentence → French sentence)
"Hello world" → [Encoder RNN] → [Decoder RNN] → "Bonjour monde"

4.6 Components of an RNN Cell


Weights in RNN:

1. W_xh (Input-to-Hidden): Transforms input to hidden state space

2. W_hh (Hidden-to-Hidden): Maintains temporal dependencies

3. W_hy (Hidden-to-Output): Produces output from hidden state

Activation Functions:

Hidden State: Typically tanh (range: -1 to 1)


Zero-centered helps with gradient flow

Output: Depends on task


Classification: Softmax

Regression: Linear

Binary: Sigmoid

Bias Terms:

b_h: Added to hidden state computation

b_y: Added to output computation

4.7 Training RNNs: Backpropagation Through Time (BPTT)


The Challenge:

RNNs have temporal dependencies, so we need to backpropagate through all time steps.

BPTT Algorithm:

1. Unroll the RNN through time

2. Forward pass through all time steps

3. Compute loss at relevant time steps

4. Backward pass from final to initial time step

5. Accumulate gradients across all time steps

6. Update weights using accumulated gradients

Mathematical Flow:
Forward: t=0 → t=1 → t=2 → ... → t=T
Backward: t=T → ... → t=2 → t=1 → t=0

At each time step going backward:


∂L/∂h_t = ∂L/∂h_(t+1) × ∂h_(t+1)/∂h_t + ∂L/∂y_t × ∂y_t/∂h_t

Gradient Accumulation:

∂L/∂W_hh = Σ (∂L/∂h_t × ∂h_t/∂W_hh) for all t

This sum considers how W_hh affects loss at ALL time steps

4.8 The Vanishing and Exploding Gradient Problem


The Core Issue:

When backpropagating through many time steps, gradients can become extremely small (vanish) or extremely
large (explode).

Why This Happens:

Vanishing Gradient:

∂h_t/∂h_0 = ∂h_t/∂h_(t-1) × ∂h_(t-1)/∂h_(t-2) × ... × ∂h_1/∂h_0

If each term < 1, the product approaches 0 exponentially:


0.5 × 0.5 × 0.5 × ... (50 times) ≈ 0.0000000000000009

Result: Network can't learn long-term dependencies

Exploding Gradient:

If each term > 1, the product grows exponentially:


1.5 × 1.5 × 1.5 × ... (50 times) ≈ 637,621

Result: Weights update by huge amounts, training becomes unstable

Real-World Impact:

Consider: "The cat, which was given to me by my friend who lives in France, was very cute"

Task: Predict the verb ("was")

Problem: The subject "cat" is 15 words away. Vanilla RNNs struggle with this distance!

Solutions:
1. Gradient Clipping (for exploding):

if ||gradient|| > threshold:


gradient = gradient × (threshold / ||gradient||)

2. Better Architectures: LSTM and GRU (next section!)

3. Better Initialization: Xavier/He initialization

4. Better Activation Functions: ReLU instead of tanh (sometimes)

4.9 Hands-On Example: Sentiment Analysis with RNN


Problem: Classify movie reviews as positive or negative

Example Reviews:

"This movie was absolutely fantastic!" → Positive

"Waste of time and money" → Negative

Architecture:

Input: Sequence of word embeddings



Embedding Layer (convert words to vectors)
each word → 128-dimensional vector

RNN Layer (64 hidden units)
processes sequence, maintains context

Take final hidden state h_T

Dense Layer (32 neurons) → ReLU

Output (1 neuron) → Sigmoid
output > 0.5 = Positive
output ≤ 0.5 = Negative

Training Process:
Example: "Great movie"

Step 1: Word to index


"Great" → 342, "movie" → 89

Step 2: Embedding
342 → [0.23, -0.45, ..., 0.12] (128 dims)
89 → [0.67, 0.34, ..., -0.23] (128 dims)

Step 3: RNN Processing


t=0: Input="Great" embedding, h_0=zeros
→ h_1 = [0.45, -0.23, ..., 0.67]

t=1: Input="movie" embedding, h_1 from previous


→ h_2 = [0.78, 0.34, ..., -0.12]

Step 4: Classification
Dense(h_2) → [0.34, -0.23, ...]
Sigmoid → 0.87 → Positive ✓

UNIT V: LONG SHORT-TERM MEMORY (LSTM) (9 Hours)


5.1 Limitations of Vanilla RNNs
The Long-Term Dependency Problem:

Vanilla RNNs struggle to connect information over long sequences due to vanishing gradients.

Example Problems:

1. Language Modeling:

"I grew up in France... [50 words later]... I speak fluent _____"

Vanilla RNN: Likely forgets "France" by this point


Correct answer: "French"

2. Question Answering:

Context: "John was born in 1990. He studied medicine. [Many sentences]"


Question: "When was John born?"

RNN needs to remember "1990" through all the context

Why This Matters:


Most real-world sequences have long-term dependencies. We need a better memory mechanism!

5.2 LSTM Architecture: Solving the Memory Problem


The Key Innovation:

LSTMs introduce a cell state that runs through the entire sequence with minimal modifications. Think of it as a
"memory highway" that allows information to flow unchanged.

LSTM Components:

1. Cell State (C_t): The long-term memory

2. Hidden State (h_t): The short-term memory (output)

3. Three Gates: Control information flow

5.3 The Three Gates of LSTM


1. Forget Gate (f_t):

Decides what information to throw away from the cell state.

f_t = σ(W_f × [h_(t-1), x_t] + b_f)

Output: Values between 0 and 1


- 0 = "completely forget this"
- 1 = "completely keep this"

Example:

Context: "The cat was hungry. The dog..."

When processing "dog", forget gate:


- Forgets subject "cat" (output ≈ 0.1)
- Prepares for new subject "dog"

2. Input Gate (i_t):

Decides what new information to store in the cell state.

i_t = σ(W_i × [h_(t-1), x_t] + b_i)


C̃_t = tanh(W_C × [h_(t-1), x_t] + b_C)

i_t: How much to update (0 to 1)


C̃_t: Candidate values to add (-1 to 1)

Example:
Processing "dog":
- Input gate opens (i_t ≈ 0.9)
- Candidate: Information about "dog" being new subject
- Adds this to cell state

3. Output Gate (o_t):

Decides what to output based on cell state.

o_t = σ(W_o × [h_(t-1), x_t] + b_o)


h_t = o_t * tanh(C_t)

o_t: What parts to output


h_t: Final hidden state (filtered cell state)

5.4 LSTM Forward Pass: Complete Mathematical Example


Given:

Previous hidden state: h_(t-1) = [0.5, -0.3]

Previous cell state: C_(t-1) = [0.8, 0.4]

Current input: x_t = [1.0, 0.0]

(Using simplified 2D vectors for clarity)

Step 1: Forget Gate

f_t = σ(W_f × [h_(t-1), x_t] + b_f)


= σ(W_f × [0.5, -0.3, 1.0, 0.0] + b_f)
= σ([0.4, 0.7])
= [0.60, 0.67]

Interpretation: Keep 60% of first memory, 67% of second memory

Step 2: Input Gate + Candidate

i_t = σ(W_i × [0.5, -0.3, 1.0, 0.0] + b_i)


= [0.75, 0.82]

C̃_t = tanh(W_C × [0.5, -0.3, 1.0, 0.0] + b_C)


= [0.45, -0.23]

Interpretation: Add 75% of first candidate, 82% of second candidate


Step 3: Update Cell State

C_t = f_t * C_(t-1) + i_t * C̃_t


= [0.60, 0.67] * [0.8, 0.4] + [0.75, 0.82] * [0.45, -0.23]
= [0.48, 0.268] + [0.338, -0.189]
= [0.818, 0.079]

Interpretation: New long-term memory combines old and new info

Step 4: Output Gate + Hidden State

o_t = σ(W_o × [0.5, -0.3, 1.0, 0.0] + b_o)


= [0.72, 0.68]

h_t = o_t * tanh(C_t)


= [0.72, 0.68] * tanh([0.818, 0.079])
= [0.72, 0.68] * [0.672, 0.079]
= [0.484, 0.054]

Interpretation: Output filtered version of cell state

Result:

New cell state C_t carries long-term memory

New hidden state h_t is passed to next layer/time step

5.5 How LSTM Solves Vanishing Gradient


The Cell State Highway:

C_t = f_t * C_(t-1) + i_t * C̃_t

Gradient flow:
∂C_t/∂C_(t-1) = f_t

Key insight: This is ADDITION, not multiplication!

Comparison:

Vanilla RNN:

∂h_t/∂h_0 = ∂h_t/∂h_(t-1) × ∂h_(t-1)/∂h_(t-2) × ... × ∂h_1/∂h_0


Multiple multiplications → vanishing gradient

LSTM:
∂C_t/∂C_0 = f_t × f_(t-1) × ... × f_1

But if forget gates ≈ 1 (keep information):


∂C_t/∂C_0 ≈ 1 × 1 × ... × 1 = 1

Gradient flows unchanged through time!

Real-World Impact:

LSTMs can learn dependencies spanning 100+ time steps, while vanilla RNNs struggle beyond 10-15 steps.

5.6 Comparing RNN and LSTM


Architecture Comparison:

Feature Vanilla RNN LSTM

Memory Single hidden state Cell state + hidden state

Gates None 3 (forget, input, output)

Parameters ~N² ~4N²

Long-term memory Poor Excellent

Training speed Fast Slower

When to Use Each:

Vanilla RNN:

Short sequences (< 30 time steps)

Real-time applications needing speed

Simple patterns

Example: Next character prediction in short words

LSTM:

Long sequences (> 30 time steps)

Complex dependencies

Better performance needed

Example: Document classification, machine translation

5.7 LSTM Variants


1. Peephole Connections:

Allow gates to look at the cell state directly.


f_t = σ(W_f × [C_(t-1), h_(t-1), x_t] + b_f)

Cell state directly influences gate decisions

2. Coupled Forget and Input Gates:

Simplification: When we forget, we must input something new, and vice versa.

f_t = σ(...)
i_t = 1 - f_t

Fewer parameters, sometimes works just as well

5.8 GRU (Gated Recurrent Unit): LSTM's Simpler Cousin


Key Differences from LSTM:

1. No separate cell state (only hidden state)

2. Two gates instead of three:


Reset gate (r_t)

Update gate (z_t)

3. Fewer parameters: ~3N² instead of 4N²

GRU Equations:

Reset Gate:
r_t = σ(W_r × [h_(t-1), x_t])

Update Gate:
z_t = σ(W_z × [h_(t-1), x_t])

Candidate Hidden State:


h̃_t = tanh(W × [r_t * h_(t-1), x_t])

Final Hidden State:


h_t = (1 - z_t) * h_(t-1) + z_t * h̃_t

Update Gate Intuition:

z_t = 0: Completely ignore new input, keep old hidden state


z_t = 1: Completely forget old state, use new candidate
z_t = 0.5: Balance between old and new
LSTM vs GRU:

Aspect LSTM GRU

Parameters More Fewer (25% less)

Training speed Slower Faster

Memory capacity Better Good

Performance Slight edge on complex tasks Similar on most tasks

Practical Advice: Try GRU first (faster training). If performance isn't good enough, try LSTM.

5.9 Components Breakdown: Forget, Input, Output Gates


Forget Gate Deep Dive:

Purpose: Selectively forget irrelevant information

Example - Subject Tracking:

Sentence: "The cat sat on the mat. The dog..."

At "cat": Cell state stores [subject=cat, number=singular, ...]


At "dog":
- Forget gate opens for "subject" (f_t ≈ 0.1)
- New subject "dog" can replace "cat"
- Keeps "number=singular" (f_t ≈ 0.9)

Input Gate Deep Dive:

Purpose: Decide what new information to add

Example - Sentiment Accumulation:

Review: "The movie had great acting but terrible plot"

At "great": i_t high, adds positive sentiment


At "terrible": i_t high, adds negative sentiment
Final: Balanced sentiment (mixed review)

Output Gate Deep Dive:

Purpose: Filter what information to output

Example - Next Word Prediction:


Input: "The capital of France is"

Cell state contains: [country=France, topic=geography, ...]


Output gate: Filters to output only "capital city" feature
Prediction: "Paris"

5.10 Hands-On: Text Generation with LSTM


Problem: Train LSTM to generate Shakespeare-style text

Architecture:

Input: Sequence of characters



Embedding Layer: Character → Vector (128 dim)

LSTM Layer 1: 256 units

LSTM Layer 2: 256 units (stacked for more capacity)

Dense Layer: vocab_size neurons (one per character)

Softmax: Probability distribution over next character

Training Example:

Input sequence: "To be or not to" Target: Next character is " " (space)

Process character by character:


't' → h_1, C_1
'o' → h_2, C_2
' ' → h_3, C_3
...
'o' → h_14, C_14

At final 'o':
Output = softmax(Dense(h_14))
= [p('a')=0.02, p(' ')=0.65, p('b')=0.05, ...]

Loss = -log(p(' ')) = -log(0.65) = 0.43

Backpropagate through time, update weights

Generation Process:
Seed text: "To be"

Step 1: Input "To be" → Model outputs distribution


Sample character: ' ' (space)

Step 2: Input "To be " → Model outputs distribution


Sample character: 'o'

Step 3: Input "To be o" → Model outputs distribution


Sample character: 'r'

Continue until desired length...


Result: "To be or not to be, that is the question..."

Temperature Sampling:

Control randomness of generation:

High temperature (T=1.5): More random, creative


p_new = softmax(logits / T)

Low temperature (T=0.5): More conservative, predictable

5.11 Applications of LSTMs


1. Machine Translation:

English → [Encoder LSTM] → Context Vector → [Decoder LSTM] → French

"Hello world" → [0.34, -0.23, ...] → "Bonjour monde"

2. Speech Recognition:

Audio waveform → [Feature extraction] → [LSTM] → Text transcription

Handles variable-length audio naturally

3. Video Classification:

Frame 1, Frame 2, ..., Frame N → [LSTM] → Action label

Understands temporal patterns in video

4. Music Generation:
Seed melody → [LSTM] → Next note → Feed back → Next note → ...

Learns rhythm, harmony, and structure

5. Anomaly Detection:

Normal sequence: t1, t2, t3, ... → [LSTM predicts] → t4


If actual t4 very different from prediction → Anomaly!

Used in: Fraud detection, equipment monitoring, cybersecurity

Summary of All 5 Units


Unit I: Foundation
Deep learning basics and motivation

Perceptrons and basic neural networks

Activation functions and training concepts

Unit II: Training Deep Networks


Forward and backward propagation

Optimization algorithms

Multi-layer perceptrons and practical training

Unit III: Spatial Data (Images)

Convolutional neural networks

Filters, pooling, and feature hierarchies

CNN architectures for computer vision

Unit IV: Sequential Data (Basic)


Recurrent neural networks

Handling variable-length sequences

Temporal dependencies

Unit V: Sequential Data (Advanced)


LSTM architecture and gates

Solving vanishing gradients


Advanced sequence modeling

Key Formulas Reference


Perceptron:

output = step(Σ(w_i × x_i) + b)

Sigmoid:

σ(x) = 1 / (1 + e^(-x))

ReLU:

ReLU(x) = max(0, x)

Gradient Descent:

w_new = w_old - η × (∂L/∂w)

Convolution Output Size:

Output = (Input - Filter + 2×Padding) / Stride + 1

LSTM Forget Gate:

f_t = σ(W_f × [h_(t-1), x_t] + b_f)

LSTM Cell State Update:

C_t = f_t * C_(t-1) + i_t * C̃_t

Exam Preparation Tips


Theory Questions (80%):

1. Understand concepts deeply: Don't just memorize, understand WHY

2. Draw diagrams: Network architectures, data flow, gate mechanisms

3. Compare and contrast: RNN vs LSTM, CNN vs MLP, different activation functions
4. Real-world applications: Be able to explain where each architecture shines

5. Limitations: Know what each model cannot do well

Math Calculations (20%):

1. Practice forward pass: Given inputs and weights, compute outputs

2. Parameter counting: Calculate number of parameters in layers

3. Output size calculations: For convolutions and pooling

4. Simple backprop: Compute gradients for toy examples

5. Probability calculations: Softmax, cross-entropy loss

Common Exam Topics:

Explain vanishing gradient problem and LSTM solution

Design CNN architecture for given problem

Calculate convolution output dimensions

Compare different activation functions

Trace data flow through RNN time steps

Explain role of each LSTM gate with examples

Good luck with your exam! 🎓

You might also like