0% found this document useful (0 votes)
19 views73 pages

Introduction to Sequence Models

Sequence models are deep learning models designed for sequential data where the order of elements is crucial, applied in various fields like speech recognition and machine translation. They can handle different types of sequence problems, including cases where both input and output are sequences or only one of them is. Recurrent Neural Networks (RNNs) are introduced as a solution to the limitations of standard neural networks in processing sequential data, utilizing shared parameters and contextual memory.

Uploaded by

thompsonjude1123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views73 pages

Introduction to Sequence Models

Sequence models are deep learning models designed for sequential data where the order of elements is crucial, applied in various fields like speech recognition and machine translation. They can handle different types of sequence problems, including cases where both input and output are sequences or only one of them is. Recurrent Neural Networks (RNNs) are introduced as a solution to the limitations of standard neural networks in processing sequential data, utilizing shared parameters and contextual memory.

Uploaded by

thompsonjude1123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Sequence Models – Lecture 1: Introduction

1.1 What Are Sequence Models?

• Definition:
Sequence models are a class of deep learning models designed to handle data that
comes in sequences — where the order of elements matters.
Unlike traditional models that treat inputs as independent (like classifying an
image), sequence models capture dependencies across time or position.

• Why They Matter:

o Many real-world problems involve sequential data.

o Examples:

▪ Speech Recognition – Input: audio waveform over time; Output:


sequence of words.

▪ Music Generation – Input: could be empty or a starting note; Output:


sequence of notes.

▪ Sentiment Classification – Input: sequence of words in a review;


Output: rating (positive/negative).

▪ DNA Sequence Analysis – Input: sequence of bases (A, C, G, T);


Output: classification of function/region.

▪ Machine Translation – Input: sentence in one language; Output:


sentence in another.

▪ Video Activity Recognition – Input: sequence of video frames;


Output: recognized activity.

▪ Named Entity Recognition (NER) – Input: sentence; Output: tags


marking names, locations, organizations, etc.

1.2 Types of Sequence Problems

• Both Input and Output are Sequences

o Example: Machine translation (French sentence → English sentence).


o Important: Input length ≠ Output length (they can differ).

• Only Input is a Sequence

o Example: Sentiment classification (movie review text → star rating).

o Input: sequence of words.

o Output: a single value/label.

• Only Output is a Sequence

o Example: Music generation (no input, or just a starting note → sequence of


notes).

1.3 Characteristics of Sequence Data

• Variable Lengths:

o Input (Tx) and Output (Ty) can have different lengths depending on the
example.

o Ex: French sentence may be 10 words; English translation may be 13 words.

• Dependency Across Positions:

o Later outputs depend on earlier inputs.

o Example: Deciding whether “Teddy” in a sentence refers to a person or a toy


requires knowing later words.

• Supervised Learning Framework:

o Training data usually given as pairs (X, Y) where:

▪ X = input sequence

▪ Y = output sequence or label.

1.4 Notation for Sequence Problems

• Input Sequence (X):

o Denoted as x<1>, x<2>, …, x<Tx>

o Tx = length of input sequence.


o Example: Sentence → words → represented as tokens.

• Output Sequence (Y):

o Denoted as y<1>, y<2>, …, y<Ty>

o Ty = length of output sequence.

• Example (NER Task):

o Input: "Harry Potter invented a new spell"

o Output: [Name, Name, Other, Other, Other, Other]


(Each word tagged as “part of a name” or not).

• Training Set Notation:

o x^(i) = i-th training example’s input sequence.

o y^(i) = i-th training example’s output sequence.

o Different examples may have different lengths, so Tx^(i) and Ty^(i) vary.

1.5 Word Representation

• Vocabulary/Dictionary:

o Build a vocabulary of the most common words.

o Example: Size = 10,000 words.

o Each word assigned an index.

o Example indices:

▪ a→1

▪ Aaron → 2

▪ and → 367

▪ Harry → 4075

▪ Potter → 6830

▪ Zulu → 10,000

• One-Hot Encoding:
o Represent words as large vectors with mostly zeros, and a 1 at the index of
the word.

o Example:

▪ “Harry” → [0, 0, 0, …, 1 (at 4075), …, 0]

▪ “Potter” → [0, …, 1 (at 6830), …, 0]

o Each vector length = size of vocabulary.

• Unknown Words (<UNK>):

o If a word isn’t in vocabulary → replace with special <UNK> token.

o Prevents model from failing on rare/new words.

1.6 Challenges with Standard Neural Networks

• Why not just feed sequences into a regular feed-forward NN?

1. Variable Input/Output Lengths:

▪ Sentences differ in length. Padding to a max length is inefficient.

2. No Shared Knowledge Across Positions:

▪ If the network learns “Harry at position 1 = Name,” it doesn’t


automatically generalize to “Harry at position 5.”

▪ Similar to why we needed Convolutions in CNNs for spatial


generalization.

Sequence Models – Lecture 2: Recurrent Neural Networks (RNNs)

2.1 Motivation for RNNs

• From Lecture 1 we saw:

o Standard feed-forward networks are not ideal for sequential data because:

1. Inputs/outputs can vary in length.


2. Learned patterns don’t generalize across positions.

3. The number of parameters explodes (large weight matrices for one-


hot vectors).

• Recurrent Neural Networks (RNNs) solve this by:

o Sharing parameters across time steps.

o Passing information from one step to the next (a kind of "memory").

o Making predictions step by step, updating as the sequence unfolds.

2.2 Structure of a Basic RNN

• Imagine reading a sentence word by word:

o At time step 1: input x<1> → hidden state a<1> → output ŷ<1>.

o At time step 2: input x<2> + hidden state from step 1 → hidden state a<2> →
output ŷ<2>.

o Repeat until the end of the sequence.

• Diagram (described in words):

o Draw a row of boxes from left to right (time steps).

o Each box contains:

▪ Input arrow going in (x<t>).

▪ Output arrow going out (ŷ<t>).

▪ A curved arrow from the previous hidden state feeding into the box.

o At time step 0, the hidden state is initialized to all zeros (a<0> = 0).

2.3 Forward Propagation Equations

• Hidden state update:

• a<t> = g(Waa * a<t-1> + Wax * x<t> + ba)

where:

o a<t> = hidden state at time t


o g = activation function (commonly tanh, sometimes ReLU)

o Waa = weights for previous hidden state

o Wax = weights for current input

o ba = bias

• Output prediction:

• ŷ<t> = g'(Wya * a<t> + by)

where:

o Wya = weights from hidden state to output

o g' = activation function for output (sigmoid for binary, softmax for multi-class)

o by = bias for output

• Initialization:

• a<0> = 0

2.4 Simplified Notation

To reduce complexity:

• Combine Waa and Wax into one big weight matrix Wa.

• Stack [a<t-1>, x<t>] into one long vector.

• Then:

• a<t> = g(Wa * [a<t-1>, x<t>] + ba)

• ŷ<t> = g'(Wy * a<t> + by)

2.5 Diagram of RNNs

• Unrolled Representation (used in this course):

o Imagine the same RNN cell repeated across time, stretched left to right.

o Inputs (x<1>, x<2>, …) enter each cell.

o Hidden state (a<t>) passes horizontally from one cell to the next.
o Outputs (ŷ<1>, ŷ<2>, …) come out vertically from each cell.

• Folded Representation (common in textbooks):

o One single box with a looping arrow back into itself (showing recurrence).

o Harder to interpret but compact.

o Mentally: “unroll” this box into multiple steps like above.

2.6 Key Characteristics of RNNs

• Parameter Sharing:

o The same weights (Wax, Waa, Wya) are used across all time steps.

o Greatly reduces number of parameters.

o Helps generalize: if the network learns that “Harry” in one position likely
signals a name, it applies everywhere.

• Contextual Memory:

o Each prediction uses information from current input + past context.

o Example: when predicting ŷ<3>, the network uses x<3> and also information
flowing from x<1>, x<2>.

• Limitation:

o Standard RNNs only look backward in time (past inputs).

o They cannot use future inputs to help current predictions.

o Example: deciding if “Teddy” is a person or a teddy bear requires knowing


later words (“Roosevelt” vs. “bears”).

o This leads to the idea of Bidirectional RNNs (BRNNs) (covered later).

2.7 Example Walkthrough

Named Entity Recognition Example (NER):

• Sentence: "Harry Potter and Hermione invented a new spell"

• Process:
1. Convert words into one-hot vectors.

2. Pass word by word into RNN.

3. At each step, hidden state updates with new information.

4. Output at each step: 1 if part of a name, 0 otherwise.

• Flow (in words):

o At x<1> (“Harry”) → hidden state updates from a<0> = 0 → output ŷ<1> = likely
1.

o At x<2> (“Potter”) → hidden state carries “Harry” → stronger prediction that


“Potter” = name.

o At x<3> (“and”) → hidden state updates, prediction ŷ<3> = 0.

o And so on.

Sequence Models – Lecture 3: Backpropagation Through Time (BPTT)

3.1 Forward Propagation Recap

Before diving into backpropagation, recall the forward pass of an RNN:

• At each time step t:

• a<t> = g(Waa * a<t-1> + Wax * x<t> + ba)

• ŷ<t> = g'(Wya * a<t> + by)

o a<t> = hidden state at time step t

o ŷ<t> = prediction at time t

o Parameters: Waa, Wax, Wya, ba, by (shared across all time steps)

• Loss Function (per step):


If target output is y<t>:

• L<t> = Loss(y<t>, ŷ<t>)

Typically cross-entropy loss for classification tasks.


• Total Loss (sequence):

• L = Σ (from t=1 to T) L<t>

3.2 The Idea of Backpropagation Through Time

• Training requires computing gradients of the total loss L with respect to the
parameters (Waa, Wax, Wya, ba, by).

• Unlike a simple feed-forward NN, RNNs reuse parameters across time steps.

• Therefore, errors must be propagated backward through each time step.

This unfolding of the computation graph backward in time gives the name:
Backpropagation Through Time (BPTT).

3.3 Computation Graph for an RNN

Forward graph (described in words):

• Inputs (x<1>, x<2>, …) enter nodes one after another.

• Hidden states (a<t>) pass horizontally to the next step.

• Outputs (ŷ<t>) branch vertically from each hidden state.

• Loss nodes L<t> attach to each output.

• Total loss L = sum of all L<t>.

Backward graph:

• Reverse arrows:

o Errors flow back from outputs ŷ<t> to hidden states a<t>.

o Errors from later time steps flow backward to earlier time steps via recurrent
connections.

o Parameters (Waa, Wax, Wya) accumulate gradient contributions from all


time steps.

3.4 Gradient Flow in BPTT


• At time t, to update parameters:

o For Wya:

o ∂L/∂Wya = Σ (∂L<t>/∂ŷ<t>) * (∂ŷ<t>/∂Wya)

o For Waa and Wax:

▪ Trickier, because a<t> depends recursively on all previous hidden


states.

▪ Gradient must “flow back” across time.

• Key point:

o The error at time T influences not just a<T>, but also earlier a<T-1>, a<T-2>, …
through the recurrent chain.

3.5 Vanishing and Exploding Gradients

When gradients are propagated back through many time steps:

• Vanishing Gradient Problem:

o Gradients shrink exponentially.

o Early steps receive almost no update signal.

o Result: model struggles with long-term dependencies (e.g., remembering


subject-verb agreement far apart in a sentence).

• Exploding Gradient Problem:

o Gradients grow exponentially.

o Parameters blow up → training becomes unstable (NaNs).

• Solutions:

o Exploding gradients: apply Gradient Clipping → cap gradient values at a


maximum threshold.

o Vanishing gradients: requires architectural fixes → GRUs and LSTMs (covered


later).
3.6 Example: Named Entity Recognition (NER)

Let’s visualize the gradient flow for the Harry Potter example:

• Forward:

o Input sentence: Harry Potter and Hermione invented …

o RNN computes outputs ŷ<t> at each word.

• Backward:

o If prediction for “Harry” was wrong, error ∂L/∂ŷ<1> flows back:

▪ To a<1>

▪ Also affects a<2>, a<3>, … since they depend on a<1> indirectly.

This explains why one mistake in early steps can affect later predictions (and vice
versa, during training).

3.7 Why “Through Time”?

• In forward pass:

o We move left → right (increasing time).

• In backprop:

o We move right → left (decreasing time).

• This reverse flow across time steps inspired the cool name: Backpropagation
Through Time (BPTT).

Think of it like: “carrying error signals back in time” .

Sequence Models – Lecture 4: RNN Architectures

4.1 Motivation

• Not all sequence problems look the same.


• Sometimes:

o Input is a single value (e.g., image).

o Output is a sequence (e.g., caption).

• Other times:

o Input is a sequence (e.g., sentence).

o Output is a single value (e.g., sentiment).

• Or even: both input and output are sequences of different lengths.

That’s why we study different architectures of RNNs.

4.2 One-to-One (Standard NN)

• Structure:

o Input → Feed-forward NN → Output.

• Diagram (in words):

o A single box with input arrow (x) entering, and output arrow (y) leaving.

• Example:

o Image classification (input = image, output = label).

• Note:

o This is not sequential, but forms the baseline for comparison.

4.3 One-to-Many

• Structure:

o Single input produces a sequence of outputs.

• Diagram (in words):

o One input arrow (x) goes into an RNN cell.

o That cell unrolls into a chain of hidden states, producing outputs (y<1>, y<2>,
…).
• Example:

o Music generation (input = genre/starting note, output = sequence of notes).

o Image captioning (input = image, output = sequence of words in a caption).

4.4 Many-to-One

• Structure:

o Sequence of inputs → Single output.

• Diagram (in words):

o Inputs (x<1>, x<2>, …, x<Tx>) enter RNN cells sequentially.

o Hidden states pass information along.

o Only the final hidden state produces an output y.

• Example:

o Sentiment analysis (input = review sentence, output = star rating).

o Fraud detection (input = transaction sequence, output = fraud/not fraud).

4.5 Many-to-Many (Same Length)

• Structure:

o Sequence of inputs → Sequence of outputs (lengths match).

• Diagram (in words):

o Inputs (x<1> … x<T>) enter RNN cells one by one.

o Each cell produces its own output (ŷ<1> … ŷ<T>).

• Example:

o Named Entity Recognition (NER):

▪ Input: "Harry Potter invented …"

▪ Output: [Name, Name, Other, …]

o Part-of-speech tagging.
4.6 Many-to-Many (Different Lengths) – Encoder–Decoder

• Structure:

o Input sequence (any length) → Encoded into a fixed-length context vector.

o Another RNN (decoder) takes the context vector and produces an output
sequence (any length).

• Diagram (in words):

o Left half: RNN cells consume input sequence, hidden state at the end
becomes context vector.

o Right half: Decoder RNN takes context vector, generates output sequence
step by step.

• Example:

o Machine Translation:

▪ Input: French sentence (length Tx).

▪ Output: English sentence (length Ty).

o Speech-to-text.

4.7 Summary of Architectures

Type Input Output Example

One-to-One Single Single Image classification

Music generation,
One-to-Many Single Sequence
captioning

Many-to-One Sequence Single Sentiment analysis

Many-to-Many (same Sequence (same


Sequence NER, POS tagging
length) length)

Many-to-Many (different Sequence (different Translation, speech-to-


Sequence
length) length) text
Sequence Models – Lecture 5: Language Models and Sequence Generation

5.1 What is a Language Model (LM)?

• Definition:
A language model assigns probabilities to sequences of words.

o Formally: Given a sequence y<1>, y<2>, …, y<T>, the model estimates

o P(y<1>, y<2>, …, y<T>)

• Why important?

o They capture grammar, meaning, and common usage patterns in a language.

o Foundation for NLP tasks:

▪ Speech recognition (pick most probable transcription).

▪ Machine translation.

▪ Text generation (chatbots, autocomplete, story generation).

5.2 Chain Rule of Probability

• To compute probability of an entire sentence, use the chain rule:

• P(y<1>, y<2>, …, y<T>)

• = P(y<1>) * P(y<2>|y<1>) * P(y<3>|y<1>, y<2>) * … * P(y<T>|y<1>, …, y<T-1>)

• Example:
Sentence = "Cats sit on mats"

• P("Cats sit on mats") =

• P("Cats") *

• P("sit" | "Cats") *

• P("on" | "Cats sit") *

• P("mats" | "Cats sit on")


A good LM should assign higher probabilities to grammatically and semantically correct
sentences.

5.3 RNN as a Language Model

• How RNN fits in:

o At each time step, the RNN takes the previous word and hidden state, then
outputs a probability distribution over the next word.

o Notation:

o ŷ<t> = softmax(Wya * a<t> + by)

o Here ŷ<t> is a vector of probabilities across the vocabulary.

• Diagram (in words):

o Imagine a chain of RNN cells.

o Input: "Cats" → hidden state → softmax → predicts next word distribution.

o Then "sit" is fed as the next input, continuing the process.

5.4 Training a Language Model

• Dataset:

o A large text corpus (e.g., Wikipedia, novels).

o Sentences broken into tokens (words or subwords).

• Supervised setup:

o Input at time t: previous word(s).

o Target at time t: actual next word.

• Loss function:

o Cross-entropy between predicted distribution and true word at each step.

• L = - Σ (over t) log P(y<t>| y<1>, …, y<t-1>)


5.5 Sequence Generation with an RNN

• Goal: generate new text by sampling from the learned probability distribution.

• Steps:

1. Start with a start token (<SOS>).

2. Feed into RNN → get probability distribution over vocabulary.

3. Sample a word (e.g., “The”).

4. Feed sampled word as next input.

5. Repeat until <EOS> (end-of-sentence) or max length.

• Diagram (in words):

o RNN cell outputs "The" → loops back in as input → outputs "cat" → loops again
→ outputs "slept" → … until stop.

5.6 Sampling Strategies

How do we choose the next word from the probability distribution?

1. Greedy Sampling:

o Always pick the word with the highest probability.

o Fast, but often repetitive (“The cat the cat the cat…”).

2. Random Sampling:

o Sample randomly according to the distribution.

o Produces more variety but may generate nonsense.

3. Temperature Sampling:

o Adjusts how “sharp” or “flat” the distribution is.

o Formula:

o P_temp(i) = exp(logits(i)/T) / Σ exp(logits(j)/T)

▪ T = temperature hyperparameter.

▪ If T < 1: makes distribution sharper → more deterministic.


▪ If T > 1: makes distribution flatter → more random.

5.7 Perplexity: Evaluating a Language Model

• Definition:
Perplexity measures how well a probability model predicts a sequence.

o For a sequence of length T:

o Perplexity = 2^(Cross-Entropy Loss)

• Interpretation:

o Lower perplexity = better model.

o Intuition: “How surprised is the model by the test set?”

▪ Perplexity = 1 → perfect prediction.

▪ Perplexity = Vocabulary size → model is random.

5.8 Example Walkthrough

Sentence Prediction:

• Input so far: "The dog chased the"

• RNN outputs distribution over next word:

o "cat" → 0.60

o "ball" → 0.25

o "sun" → 0.05

o <UNK> → 0.10

• With greedy sampling: Next word = "cat".

• With higher temperature random sampling: "ball" might be chosen sometimes,


making text more varied.
Sequence Models – Lecture 6: Sampling Methods and Beam Search

6.1 The Problem with Greedy Sampling

• Greedy sampling: always choose the word with the highest probability at each step.

• Issue:

o Produces locally optimal choices, but may miss globally better sentences.

o Often repetitive or bland (e.g., “the cat is the cat is the cat…”).

• Example:

o Input: "I love"

o Greedy choices:

▪ Step 1: "you" (highest prob).

▪ Step 2: "so" (highest prob).

▪ Step 3: "much" (highest prob).

o Sentence: "I love you so much"

o This is fine, but sometimes greedy search locks into poor early choices.

6.2 Random Sampling

• Method: Instead of picking the most probable word, sample from the distribution.

• Effect: Adds variety, but can generate nonsense if probabilities are flat.

• Problem: No guarantee of quality or coherence.

6.3 Temperature Sampling

• Adjusts the “creativity” of the model by scaling logits before softmax.

• Formula:

• P_temp(i) = exp(logits(i)/T) / Σ exp(logits(j)/T)


o T < 1 → sharper distribution, more deterministic.

o T > 1 → flatter distribution, more random.

• Example:

o Logits: [2, 1, 0] → softmax ≈ [0.67, 0.24, 0.09]

o With T = 0.5 → sharper: [0.88, 0.11, 0.01]

o With T = 2.0 → flatter: [0.45, 0.33, 0.22]

Useful for controlling diversity in text generation.

6.4 Beam Search – Motivation

• Problem: Greedy search may miss better sequences.

• Solution: Beam search explores multiple candidate sequences in parallel.

6.5 How Beam Search Works

• Beam width (B): number of sequences kept at each step.

• Algorithm:

1. Start with <SOS> token.

2. At step 1:

▪ Model outputs probabilities for all words.

▪ Keep top B candidates.

3. At step 2:

▪ Expand each candidate by one more word.

▪ Keep top B total candidates.

4. Repeat until <EOS> or max length.

• Diagram (in words):

o Imagine a branching tree of possible next words.

o Greedy search → follow only the top branch.


o Beam search → follow the top B branches at every step.

6.6 Scoring in Beam Search

• To decide which sequences survive, compute total probability:

• P(y<1>, …, y<T>) = Π P(y<t> | y<1>, …, y<t-1>)

• In practice, use log probabilities to avoid underflow:

• score = Σ log P(y<t> | history)

• Length normalization:

o Without adjustment, shorter sequences tend to have higher average


probability.

o Fix: Normalize by sequence length:

o normalized_score = (1/T^α) * Σ log P(y<t>)

▪ α (usually between 0.6–1.0) controls penalty strength.

6.7 Example: Translation with Beam Width = 3

Sentence: French "Je suis étudiant" → English.

1. Step 1:

o Possible outputs:

▪ "I" (0.6), "It" (0.3), "He" (0.1).

o Keep top 3: "I", "It", "He".

2. Step 2:

o Expand "I":

▪ "am" (0.5), "study" (0.2), "student" (0.1).

o Expand "It":

▪ "is" (0.6), "was" (0.2).

o Expand "He":
▪ "is" (0.5), "studies" (0.3).

o Now compute cumulative log probs and keep top 3 candidates.

3. Step 3:

o Continue expansion until <EOS>.

o Best result: "I am a student" chosen over "I student" because of higher
normalized probability.

6.8 Greedy vs Beam Search

Method Pros Cons

Greedy Search Fast, simple Misses better sequences

Random Sampling More diverse Can generate nonsense

Temperature Tunable creativity Still random, not globally optimal

Beam Search Balances quality & efficiency Slower, requires beam width tuning

Sequence Models – Lecture 7: Practical Tips for RNNs

7.1 Handling Large Vocabularies

• Problem:

o Vocabulary sizes in natural language can be huge (50k–1M words).

o Softmax at each step requires computing probabilities for all words → very
expensive.

• Solutions:

1. Restrict Vocabulary:

▪ Keep only the N most frequent words (e.g., top 10,000).


▪ Replace all others with <UNK> (“unknown token”).

2. Subword Units (Byte-Pair Encoding, WordPiece):

▪ Break rare words into smaller chunks.

▪ Example:

▪ "unhappiness" → ["un", "happiness"]

▪ "happiness" → ["happi", "ness"]

▪ Keeps vocabulary small while still handling rare words.

3. Character-level Models:

▪ Instead of words, use characters as tokens.

▪ Pros: handles any word.

▪ Cons: sequences become longer, harder to train.

7.2 Dealing with Unknown Words

• Unknown token (<UNK>):

o Placeholder for out-of-vocabulary (OOV) words.

• Limitations:

o Model loses meaning if <UNK> replaces important words (e.g., "I am studying
<UNK>").

• Better approach:

o Subword modeling (so "TransformerXL" might split into "Transformer" + "XL"


instead of <UNK>).

7.3 Softmax Efficiency Tricks

• Why needed: computing full softmax for vocab size V is O(V) per step.

• Tricks:

1. Hierarchical Softmax:

▪ Organize vocabulary in a tree structure.


▪ Predict path to word instead of full distribution.

▪ Reduces cost from O(V) → O(log V).

2. Sampling-based methods (e.g., Noise Contrastive Estimation):

▪ Train model to distinguish true word vs sampled “noise” words.

▪ Avoids computing full softmax during training.

7.4 Exploding and Vanishing Gradients (Practical Fixes)

• Exploding gradients:

o When gradient values grow too large.

o Fix: gradient clipping → cap values at a threshold.

• Vanishing gradients:

o Gradients shrink across many time steps.

o Fix: use advanced architectures (GRU, LSTM) instead of vanilla RNN.

7.5 Choosing Sequence Lengths

• Challenge:

o Real-world sentences can be very long.

o Training RNNs on entire long documents is inefficient.

• Strategy:

o Truncate sequences to manageable length (e.g., 20–50 tokens).

o Process longer texts in chunks/sliding windows.

o Example:

▪ Long review: "The movie was great but too long. Acting was fine,
music was excellent …"

▪ Split into smaller training chunks of ~30 tokens each.


7.6 Mini-batching for Sequences

• Why: GPUs process batches more efficiently.

• Problem: Sentences have variable lengths.

• Solution:

o Padding: Add <PAD> tokens to shorter sentences so all in batch have same
length.

o Masking: Ignore <PAD> tokens during loss computation.

7.7 Shuffling and Bucketing

• Naïve batching:

o Randomly shuffle sentences into batches.

o Problem: large padding waste if one very long sentence is grouped with very
short ones.

• Bucketing:

o Group sentences of similar length together.

o Reduces padding, improves training efficiency.

7.8 Preprocessing Pipeline (Summary)

1. Collect dataset (raw text).

2. Tokenize (words/subwords/characters).

3. Build vocabulary (keep top N words, handle <UNK>).

4. Convert sentences → sequences of indices.

5. Apply padding + masking.

6. Organize into mini-batches (possibly using bucketing).


Sequence Models – Lecture 8: Gated Recurrent Unit (GRU)

8.1 Motivation for GRUs

• Problem with vanilla RNNs:

o They struggle to capture long-term dependencies.

o Gradients tend to vanish as they’re propagated through many time steps.

o Example:

▪ Sentence: “The cat … that lived in the house … was cute.”

▪ To predict “was”, the model needs to remember “cat” from much


earlier.

▪ Vanilla RNN often forgets.

• Solution: Add gates to control what information to keep and what to forget.

8.2 GRU Architecture Overview

• GRU = RNN with gating mechanisms.

• Each hidden state update is controlled by two gates:

1. Update Gate (z) → decides how much of the past information to carry
forward.

2. Reset Gate (r) → decides how much of the past to forget when computing the
new candidate.

This makes the GRU capable of learning long-term memory.

8.3 GRU Equations (Step-by-Step)

For input x<t> at time t, previous hidden state h<t-1>:

1. Update Gate:

2. z<t> = σ(Wz · [h<t-1>, x<t>] + bz)


o σ = sigmoid (output between 0 and 1).

o If z ≈ 1 → keep old state.

o If z ≈ 0 → replace with new information.

3. Reset Gate:

4. r<t> = σ(Wr · [h<t-1>, x<t>] + br)

o Controls how much of the past h<t-1> influences the new candidate.

5. Candidate Hidden State:

6. ̃h<t> = tanh(W · [r<t> * h<t-1>, x<t>] + b)

o Uses reset gate to decide which past parts to ignore before computing new
content.

7. Final Hidden State:

8. h<t> = z<t> * h<t-1> + (1 - z<t>) * ̃h<t>

o If z<t> is close to 1 → carry forward old state (h<t-1>).

o ̃ <t>).
If z<t> is close to 0 → replace with candidate (h

o Smooth combination = memory control.

8.4 Intuition Behind GRU

• Update Gate (z):

o Like a leaky memory switch: decides “should I remember or overwrite?”.

• Reset Gate (r):

o Like a forget switch: decides “should I ignore past context when building
new memory?”.

• Together: GRU can keep relevant old information alive for many time steps.

8.5 Diagram (in words)

• Picture a flowchart:

o Input x<t> and previous hidden h<t-1> enter two “gate boxes” (z and r).
o Reset gate r controls what portion of h<t-1> passes into candidate
calculation.

o Update gate z blends between old state and candidate to form h<t>.

8.6 Example: Sentiment Analysis

Sentence: “The movie was not good.”

• At the word “not”:

o Reset gate ignores irrelevant earlier words.

o Candidate hidden state emphasizes “not”.

o Update gate ensures this negation strongly influences final hidden state.

• This allows GRU to correctly predict negative sentiment, where vanilla RNN might
forget “not”.

8.7 Advantages of GRU

• Solves vanishing gradient issue better than vanilla RNN.

• Requires fewer parameters than LSTM (simpler design).

• Good performance on many NLP tasks (translation, speech recognition, sentiment


analysis).

• Faster to train than LSTM.

8.8 GRU vs RNN

Feature Vanilla RNN GRU

Memory Short-term only Long-term possible

Vanishing Gradients Severe Mitigated with gates

Complexity Simple Moderate

Performance Weak on long texts Stronger on long dependencies


Sequence Models – Lecture 9: Long Short-Term Memory (LSTM)

9.1 Motivation for LSTMs

• Vanilla RNNs → suffer from vanishing/exploding gradients → can’t capture long-


term dependencies.

• GRUs → introduced gates (reset & update) to handle memory.

• But: GRUs may still be too simple for complex dependencies.

LSTMs (proposed by Hochreiter & Schmidhuber, 1997) add a separate memory cell
and more gates for finer control.

9.2 LSTM Core Idea

• Maintain two types of memory:

1. Cell state (c<t>) → acts like a “conveyor belt” for long-term memory.

2. Hidden state (h<t>) → short-term working memory for immediate use.

• Gates decide what information to forget, update, and output.

9.3 LSTM Gates

Each gate is a sigmoid layer that outputs values between 0 and 1.

• 0 → “block everything”

• 1 → “let everything pass”

1. Forget Gate (f)

f<t> = σ(Wf · [h<t-1>, x<t>] + bf)

• Decides what fraction of the old cell state to forget.

• Example: In sentence “The cat … was cute”, the forget gate can drop irrelevant
earlier words.

2. Input Gate (i) and Candidate (c


̃ )
i<t> = σ(Wi · [h<t-1>, x<t>] + bi)

̃c<t> = tanh(Wc · [h<t-1>, x<t>] + bc)

• i<t> = controls how much new information to write.

• ̃c<t> = candidate values that could be added to the cell state.

3. Cell State Update

c<t> = f<t> * c<t-1> + i<t> * ̃c<t>

• Old memory is partially forgotten (f<t>).

• New candidate information is added (i<t> * ̃c<t>).

4. Output Gate (o)

o<t> = σ(Wo · [h<t-1>, x<t>] + bo)

• Decides what part of the cell state contributes to hidden state.

5. Final Hidden State

h<t> = o<t> * tanh(c<t>)

• Hidden state is based on filtered (output gate) and transformed cell state.

9.4 Intuition Behind the LSTM Flow

• Cell state c<t> = long-term conveyor belt → information can flow across many
steps with little change.

• Gates act like valves:

o Forget gate → decides what to erase.

o Input gate → decides what to add.

o Output gate → decides what to reveal.

Unlike GRU, LSTM separates what to remember and what to output.

9.5 Diagram (in words)

• Imagine a horizontal highway (cell state).


• Along the highway:

o Forget gate removes unneeded baggage.

o Input gate drops in new information.

o Output gate lets part of the memory show up as hidden state.

9.6 Example: Predicting Sentiment

Sentence: “The movie was not good.”

• At “not”:

o Input gate writes strong negative signal to cell state.

o Forget gate may keep earlier context but downplay irrelevant words.

o Output gate makes sure the hidden state reflects the negativity.

• Result: LSTM outputs “negative sentiment” more reliably than RNN/GRU.

9.7 LSTM Variants

• Peephole LSTM:

o Gates also look directly at the cell state.

• Coupled Input-Forget Gate:

o Sometimes f = 1 - i to reduce parameters.

• Bi-directional LSTMs (BiLSTM):

o Process sequence both forward and backward → captures past + future


context.

• Stacked LSTMs:

o Multiple LSTM layers stacked → deeper model with richer representations.

9.8 GRU vs LSTM Comparison


Feature GRU LSTM

Gates 2 (update, reset) 3 (forget, input, output)

Cell state No explicit cell state Yes, explicit c<t>

Parameters Fewer (faster to train) More (slower, heavier)

Performance Good, simpler tasks Better on complex tasks

Popularity Used often Still widely used in NLP

9.9 Advantages of LSTMs

• Much better at capturing long-range dependencies.

• Can selectively remember important context.

• Powerhouse of NLP for years (before Transformers).

• Used in:

o Machine translation

o Speech recognition

o Text summarization

o Image captioning
Sequence Models – Lecture 10: Bidirectional RNNs (BiRNNs)

10.1 Motivation

• Standard RNNs (including GRUs/LSTMs) process input sequentially left → right.

• Problem:

o At time step t, the model only knows the past context (tokens before t).

o But in many tasks, future context is just as important.

• Example:

o Sentence: “I went to the bank to deposit money.”

o The word “bank” is ambiguous until you see the future word “money”.

o A standard RNN (left-to-right) might guess wrong.

Solution: Bi-directional RNNs — process the sequence both directions and combine.

10.2 Architecture

• Two RNNs are trained:

1. Forward RNN: processes sequence from left → right.

2. Backward RNN: processes sequence from right → left.

• At each time step t:

• h<t> = [ h_forward<t> ; h_backward<t> ]

(concatenation of forward and backward hidden states).

• This way, each h<t> encodes past + future context.

10.3 Example

Sentence: “I read the book.”

• Forward RNN (left → right):

o Builds meaning step by step:


▪ “I” → subject,

▪ “I read” → action,

▪ “I read the” → incomplete,

▪ “I read the book” → full.

• Backward RNN (right → left):

o Starts from the end:

▪ “book” → object,

▪ “the book” → noun phrase,

▪ “read the book” → action phrase,

▪ “I read the book” → full.

• Combined hidden state = rich representation with both perspectives.

10.4 Applications of BiRNNs

1. Named Entity Recognition (NER)

o Example: In “I met John Smith in Paris”,

▪ Future words (“Smith”) help recognize “John” as a person.

2. Part-of-Speech (POS) Tagging

o Example: Word “flies” → noun in “time flies” vs verb in “she flies”.

o Future context disambiguates.

3. Speech Recognition

o Need full audio sequence (past + future) to interpret correctly.

4. Machine Translation

o Bidirectional encoder helps capture both left and right context before
translating.

10.5 Limitations of BiRNNs


• Not usable in real-time streaming tasks (e.g., live speech recognition, predictive
text).

o Because you must wait for the entire sequence to be available before using
the backward RNN.

• Higher computational cost than unidirectional RNNs.

10.6 BiRNN Variants

• Bi-GRU: uses GRUs instead of vanilla RNN cells.

• Bi-LSTM: the most common → combines LSTM’s long-term memory with


bidirectionality.

10.7 Comparison: Uni vs Bi-directional RNN

Feature Uni-directional RNN Bi-directional RNN

Context used Past only Past + future

Real-time use Yes No (needs full sequence)

Accuracy Moderate Higher (esp. in NLP tasks)

Complexity Lower Higher (double parameters)


Sequence Models – Lecture 11: Deep RNNs

11.1 Motivation

• Neural networks in computer vision (CNNs) → deeper models = better


representations.

• Same idea applies to RNNs: stacking multiple recurrent layers lets the model:

o Capture low-level features in lower layers.

o Capture high-level features in higher layers.

Example:

• In speech recognition:

o Lower layers detect raw phonemes/sounds.

o Higher layers capture words, grammar, and meaning.

11.2 Architecture of a Deep RNN

• Instead of one hidden state sequence, we stack multiple RNN layers:

Single-layer RNN:

x<t> → RNN → h<t> → y<t>

Multi-layer (Deep) RNN:

x<t> → RNN1 → h1<t> → RNN2 → h2<t> → … → RNNk → hk<t> → y<t>

• Each layer passes its hidden state sequence to the next layer.

• At the top: final output predictions are made.

11.3 Example (2-layer LSTM)

Sentence: “The dog barked loudly.”

1. Layer 1 (bottom LSTM):

o Focuses on local word-level dependencies.


o Hidden states capture phrase structure like “dog barked”.

2. Layer 2 (top LSTM):

o Builds on layer 1 outputs.

o Captures broader meaning: subject (“dog”), action (“barked”), modifier


(“loudly”).

11.4 Why Go Deep?

• Advantages:

1. Better representations → deeper abstractions.

2. Ability to handle more complex patterns (syntax, semantics, context).

3. Improved accuracy in NLP tasks (translation, speech, captioning).

• Drawbacks:

1. Training gets harder (vanishing gradients even with LSTM/GRU).

2. Slower, more computationally expensive.

3. Risk of overfitting on small datasets.

11.5 Training Deep RNNs

• Practical techniques to make deep RNNs work:

1. Residual Connections / Skip Connections

▪ Pass input directly to higher layers to ease training.

2. Dropout Regularization

▪ Randomly drop units during training → prevents overfitting.

▪ Applied between RNN layers.

3. Layer Normalization

▪ Normalizes hidden states → stabilizes training.

4. Careful Initialization
▪ Initialize weights well to avoid exploding/vanishing gradients.

11.6 Applications of Deep RNNs

• Speech Recognition:

o Deep RNNs capture low-level audio features → high-level words.

• Machine Translation:

o Deep LSTM encoders/decoders → better sentence representations.

• Text Generation:

o Deeper models can generate more coherent, contextually rich text.

• Music Generation:

o Deep RNNs model both rhythm (short-term) and melody/harmony (long-


term).

11.7 Shallow vs Deep RNNs

Feature Shallow RNN (1 layer) Deep RNN (multi-layer)

Representation Power Limited Higher (multi-level features)

Training Ease Easier Harder (needs tricks)

Speed Faster Slower

Accuracy Moderate Better on complex tasks


Sequence Models – Lecture 12: Encoder–Decoder Models (Seq2Seq)

12.1 Motivation

• Many sequence tasks involve input and output of different lengths.

o Examples:

▪ Machine Translation: English sentence → French sentence.

▪ Speech Recognition: Audio sequence → text.

▪ Text Summarization: Long document → short summary.

• Problem with vanilla RNNs:

o They expect one output per input step.

o Not flexible when output length ≠ input length.

Solution: Encoder–Decoder architecture.

12.2 Architecture Overview

• Encoder: Reads the input sequence and compresses it into a context vector
(sometimes called a “thought vector”).

• Decoder: Takes the context vector and generates the output sequence, step by
step.

Step-by-step:

1. Input sequence: x(1), x(2), …, x(Tx)

2. Encoder processes and outputs hidden states h(1)…h(Tx).

3. Last hidden state h(Tx) (or combination) = context vector.

4. Decoder starts with context vector → generates y(1), y(2), …, y(Ty).

12.3 Encoder (Details)

• Typically an RNN/GRU/LSTM.
• Reads the sequence token by token.

• Final hidden state summarizes the entire input.

Example:

• Input = “I love dogs”.

• Encoder produces context vector (compressed meaning).

12.4 Decoder (Details)

• Another RNN/GRU/LSTM.

• At each time step:

o Takes previous hidden state, previous output word, and context vector.

o Outputs next word probability.

Equation form (simplified):

s<t> = RNN(s<t-1>, y<t-1>, context)

y<t> = softmax(W · s<t>)

12.5 Training the Seq2Seq Model

• Teacher Forcing is commonly used:

o During training, the decoder receives the true previous word (ground truth),
not its own predicted word.

o Example:

▪ Input: “I love dogs” → “J’adore les chiens”.

▪ At step 1, decoder gets <START> → outputs “J’”.

▪ At step 2, instead of using predicted “J’”, it is fed the correct “adore”.

• This speeds up learning and stabilizes training.

12.6 Inference (Testing)


• At test time: ground truth is not available.

• The decoder must use its own predictions from previous steps as input.

• Process continues until:

o End-of-sequence token <EOS> is generated, or

o Maximum output length reached.

12.7 Example: Machine Translation

• Input: “I am a student.”

• Encoder compresses into vector.

• Decoder generates: “Je suis étudiant.”

Even though English (4 words) and French (3 words) have different lengths, the model
works.

12.8 Challenges of Basic Encoder–Decoder

1. Information Bottleneck:

o All input meaning must be squeezed into one vector.

o Long sentences = hard to summarize into a single vector.

o Example: “The book I borrowed from the library last week … was amazing” →
too much info in one vector.

2. Long-term dependencies:

o Early parts of input may be lost by the time the encoder finishes.

o Translation suffers on long sequences.

Later solution: Attention Mechanism (coming soon).

12.9 Variants of Encoder–Decoder

• Bi-directional Encoder: uses BiLSTM to encode richer context.


• Stacked Encoder–Decoder: multiple layers for deeper representation.

• Attention-based Encoder–Decoder: context vector is dynamic (not fixed) → solves


bottleneck.

12.10 Applications

• Machine Translation (Google Translate, DeepL).

• Text Summarization (news → headline).

• Speech Recognition (audio → text).

• Image Captioning (image features encoded → text decoded).

• Dialogue Systems / Chatbots.

12.11 Encoder–Decoder in Action (Walkthrough)

Let’s walk through “I love dogs” → “J’adore les chiens”:

1. Encoding:

o “I” → hidden state h1.

o “love” → h2.

o “dogs” → h3.

o Final context vector = h3.

2. Decoding:

o <START> + context → “J’”.

o “J’” + context → “adore”.

o “adore” + context → “les”.

o “les” + context → “chiens”.

o End at <EOS>.

Sequence Models – Lecture 13: Attention Mechanism


13.1 Motivation

• Recall: Encoder–Decoder has a bottleneck.

o Entire input sequence → compressed into one fixed context vector.

o Works okay for short sentences.

o Fails on long or complex ones.

Example:

• Input: “The book I borrowed from the library last week, which was recommended by
my professor, turned out to be amazing.”

• Too much detail for one vector to capture → translation or summarization becomes
poor.

Attention fixes this by letting the decoder look at different parts of the input when
needed.

13.2 Core Idea

• Instead of one fixed context vector, compute a weighted sum of all encoder hidden
states.

• Weights = how relevant each input word is to the current output step.

Equation form:

context<t> = Σ (α<t, i> * h<i>)

• h<i> = hidden state of encoder at input position i.

• α<t, i> = attention weight → how much decoder step t should “pay attention” to input
i.

13.3 Attention Weights

• Attention weight is computed by comparing:

o Current decoder state s<t-1>

o With each encoder hidden state h<i>


Alignment Score:

score(s<t-1>, h<i>)

• Measures how well input at position i matches output at position t.

Softmax Normalization:

α<t, i> = softmax(score(s<t-1>, h<i>))

• Ensures weights sum to 1 (like probabilities).

13.4 Types of Scoring Functions

• Dot Product:

• score(s, h) = s · h

• General (with weight matrix):

• score(s, h) = sᵀ W h

• Concat + Feedforward:

• score(s, h) = vᵀ tanh(W[s; h])

13.5 Decoder with Attention

At time step t:

1. Compute attention weights α<t, i>.

2. Create context vector context<t> = Σ α<t, i> * h<i>.

3. Use context vector + decoder state to generate output word.

13.6 Intuition (Word-by-Word Example)

Sentence: “I love dogs” → “J’adore les chiens”

• At output step “J’”:

o Decoder pays attention mostly to “I”.

• At output step “adore”:


o Decoder pays attention mostly to “love”.

• At output step “chiens”:

o Decoder pays attention mostly to “dogs”.

Attention dynamically shifts focus instead of relying on one compressed vector.

13.7 Visualizing Attention

• Attention can be visualized as a matrix (heatmap):

o Rows = output words.

o Columns = input words.

o Each cell = weight (α) → how much attention output word pays to input word.

Example (English → French translation heatmap):

I love dogs

J’ 0.9 0.05 0.05

adore 0.1 0.85 0.05

les 0.05 0.05 0.9

chiens 0.05 0.05 0.9

13.8 Benefits of Attention

1. Solves bottleneck problem → no more single-vector compression.

2. Better long-sequence performance.

3. Interpretable → can see which input words the model attends to.

4. General-purpose → used in translation, summarization, speech, vision, etc.

13.9 Variants of Attention

• Bahdanau Attention (Additive Attention)

o Uses feedforward network to compute scores.


• Luong Attention (Multiplicative Attention)

o Uses dot-product scoring (faster).

• Self-Attention (next lecture → Transformers)

o Words attend to other words in the same sequence.

13.10 Applications

• Machine Translation: More accurate translations.

• Summarization: Focus on key sentences.

• Image Captioning: Attend to regions in image while generating words.

• Speech Recognition: Focus on relevant frames of audio.

13.11 Walkthrough Example

Sentence: “The cat sat on the mat.”

• Decoder output step: generating “chat” (French for “cat”).

• Attention focuses strongly on “cat”.

• Next step: generating “tapis” (mat).

• Attention shifts to “mat”.

The model is like a human translator — doesn’t memorize entire sentence, but looks at
relevant parts as needed.
Sequence Models – Lecture 14: Self-Attention and Transformers

14.1 Motivation

• RNNs/GRUs/LSTMs process sequences sequentially.

o Hard to parallelize (must go step by step).

o Struggle with very long-range dependencies.

• Attention (from Lecture 13) solved the bottleneck, but still used RNN
encoders/decoders.

Self-Attention eliminates RNNs entirely, allowing models to capture relationships in


one step and train in parallel.
This idea led to the Transformer architecture (Vaswani et al., 2017).

14.2 What is Self-Attention?

• In normal attention: decoder attends to encoder states.

• In self-attention: each word in a sequence attends to all other words in the same
sequence.

Each word builds a representation based on its context (before and after).

14.3 Queries, Keys, and Values

The mechanism is inspired by information retrieval:

• Query (Q): what I’m looking for.

• Key (K): labels for all stored information.

• Value (V): actual content/information.

Process

1. Each input word embedding → projected into Q, K, V vectors (using learned weight
matrices).

2. Compute similarity between a word’s query and all words’ keys.


3. Use similarities (after softmax) as weights to take a weighted sum of values.

Equation:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

• √d_k = scaling factor (prevents large dot products).

14.4 Example of Self-Attention

Sentence: “The animal didn’t cross the street because it was too tired.”

• Word: “it” → should refer to “animal”, not “street”.

• Self-attention:

o Query for “it” compares with keys of all words.

o Highest similarity is with “animal”.

o Output embedding for “it” incorporates info from “animal”.

Model resolves pronouns, dependencies, etc., even across long distances.

14.5 Multi-Head Attention

• Instead of one attention function, use multiple heads in parallel.

• Each head learns to capture different types of relationships:

o Head 1: subject–verb relations.

o Head 2: coreference (it ↔ animal).

o Head 3: positional info, etc.

Equation:

MultiHead(Q,K,V) = Concat(head1, head2, …, headh) W_o

14.6 Transformer Encoder

The Transformer encoder consists of repeated blocks:

1. Multi-Head Self-Attention Layer


o Each token attends to all others.

2. Feedforward Neural Network (FFN)

o Position-wise fully connected layers.

3. Residual Connections + Layer Normalization

o Help stabilize training.

4. Positional Encoding

o Since self-attention has no notion of order, add sine/cosine position


embeddings to word vectors.

14.7 Transformer Decoder

• Similar to encoder, but with extra step:

1. Masked Multi-Head Self-Attention

o Prevents attending to future tokens (important for generation).

2. Encoder–Decoder Attention

o Decoder attends to encoder outputs.

3. Feedforward + Normalization

This allows autoregressive generation (one token at a time).

14.8 Benefits of Transformers

1. Parallelization

o Unlike RNNs, all words processed at once.

o Training speed massively improved.

2. Long-Range Dependencies

o Any word can directly attend to any other word.

3. Scalability

o Transformers scale well to billions of parameters.


4. State of the Art

o Almost all modern NLP (and even vision, speech) uses Transformers.

14.9 Applications

• Language Models: GPT, BERT, T5.

• Machine Translation: Google Translate (post-2017).

• Summarization & QA.

• Speech Recognition.

• Computer Vision (ViT – Vision Transformer).

• Multimodal AI (text+image, text+speech).

14.10 Example: Translation with Transformer

Input: “I love dogs.”

• Encoder: self-attention builds contextual embeddings.

• Decoder: attends to encoder while generating “J’adore les chiens.”

• Each output step is influenced by all input words + previous outputs.

14.11 Summary of Key Equations

• Scaled Dot-Product Attention:

• Attention(Q,K,V) = softmax(QKᵀ / √d_k) V

• Multi-Head Attention:

• MultiHead(Q,K,V) = Concat(head1, …, headh) W_o

• Transformer Layer =

• x → MultiHead → Add+Norm → FFN → Add+Norm


Sequence Models – Lecture 15: Transformer Applications & Extensions

15.1 Recap: Transformer Core

• Transformer = stack of encoder and/or decoder layers with:

o Multi-Head Self-Attention

o Feedforward layers

o Residual + Layer Normalization

o Positional Encoding

• Depending on which part (encoder/decoder) is used, we get different architectures.

15.2 Encoder-Only Models

These use just the encoder.

• Input sequence → contextualized embeddings → task-specific head.

Example: BERT (Bidirectional Encoder Representations from Transformers)

• Key Idea: Pre-train on massive text, then fine-tune for tasks.

• Training Objectives:

1. Masked Language Modeling (MLM):

▪ Randomly mask 15% of words, model predicts them.

▪ Example: “The cat sat on the ___.” → model fills “mat”.

2. Next Sentence Prediction (NSP):

▪ Model predicts whether two sentences follow each other.

• Strengths:

o Powerful bidirectional representations.

o Excellent for classification tasks: sentiment analysis, NER, QA.

15.3 Decoder-Only Models


These use just the decoder, with causal (masked) self-attention.

• At step t, each word only attends to previous words (no peeking ahead).

Example: GPT (Generative Pretrained Transformer)

• Key Idea: Predict the next word (language modeling).

• Training Objective:

o Autoregressive prediction:

o P(w1, w2, …, wn) = Π P(wt | w1, …, w(t-1))

• Strengths:

o Amazing text generation.

o Powers ChatGPT, Copilot, etc.

o Used for creative tasks, dialogue, code generation.

15.4 Encoder–Decoder Models

These use both encoder + decoder.

• Encoder → builds representation of input.

• Decoder → generates output sequence.

Example: T5 (Text-to-Text Transfer Transformer)

• Treats every NLP task as text in → text out.

• Examples:

o Translation: “Translate English to French: The book is old.” → “Le livre est
vieux.”

o Summarization: “Summarize: The book was…” → “The book was old.”

o Classification: “Is this review positive or negative? The movie was great.” →
“Positive”.

Example: BART

• Pre-trained with denoising autoencoder tasks (corrupt input → reconstruct).

• Great for summarization and text generation.


15.5 Vision Transformers (ViT)

• Transformer applied to images.

• Idea: treat an image as a sequence of patches.

o Example: 224x224 image → split into 16x16 patches = 196 tokens.

• Patches → linear embeddings + positional encodings → Transformer.

• Outperforms CNNs on large datasets (e.g., ImageNet).

15.6 Multimodal Transformers

• Combine different input types (text, image, audio).

• Examples:

o CLIP (Contrastive Language-Image Pretraining): Aligns images with


captions.

o DALL·E: Generates images from text prompts.

o Whisper: Speech → text (transcription).

15.7 Extensions of Transformers

1. Efficient Transformers (deal with long sequences):

o Sparse Attention, Linformer, Longformer.

o Reduce O(n²) cost of self-attention.

2. Pretraining + Fine-tuning Paradigm:

o Pretrain on huge corpora → fine-tune on task-specific data.

3. Prompting & Instruction Tuning:

o Instead of fine-tuning, steer model via text prompts.

o Instruction-tuned models (like ChatGPT) trained to follow human intent.


15.8 Why Transformers Dominate

• Scalable: parallel training across GPUs.

• Versatile: text, images, speech, protein folding (AlphaFold).

• Generalizable: one architecture across domains.

• Interpretability: attention weights offer some insight into reasoning.

15.9 Summary Table

Model Type Example Architecture Used Task Focus

Transformer
Encoder-only BERT Understanding (classification, QA)
Encoder

Transformer
Decoder-only GPT Generation (dialogue, writing, code)
Decoder

Encoder-
T5, BART Full Transformer Seq2Seq (translation, summarization)
Decoder

Transformer
Vision-based ViT Computer Vision
Encoder

CLIP, Cross-domain (text+image,


Multimodal Mixed
DALL·E speech+text)
Sequence Models – Lecture 16: Training Large Transformers

16.1 Motivation

• Training Transformers is computationally expensive:

o Billions of parameters.

o Massive datasets (terabytes of text).

o Long training times (weeks/months on supercomputers).

• To make them work in practice, researchers use:

1. Optimization tricks

2. Hardware strategies

3. Scaling principles

16.2 Optimization Tricks

(a) Weight Initialization

• Poor initialization → unstable training.

• Transformers typically use Xavier/Glorot or He initialization.

• Helps maintain stable variance in activations across layers.

(b) Learning Rate Scheduling

• Transformers often use a warmup schedule:

o Start with small learning rate (avoid divergence).

o Gradually increase → then decay.

Equation (common in “Attention is All You Need”):

lr = d_model^(-0.5) * min(step^-0.5, step * warmup^-1.5)

• d_model = embedding dimension.

• warmup = number of steps to ramp up.

(c) Gradient Clipping


• Avoids exploding gradients by capping gradient norm.

(d) Dropout + Regularization

• Dropout in attention weights and feedforward layers prevents overfitting.

16.3 Training Data

• Large models trained on web-scale datasets:

o Books, Wikipedia, news, social media, code.

• Data preprocessing:

o Tokenization (often subword units like Byte Pair Encoding or SentencePiece).

o Filtering noisy or duplicated text.

• Quality matters: “Garbage in → garbage out.”

16.4 Hardware and Parallelization

(a) Data Parallelism

• Dataset split across GPUs.

• Each GPU processes a mini-batch → gradients averaged.

(b) Model Parallelism

• Model layers/parameters split across GPUs.

• Useful when model is too large for one GPU.

(c) Pipeline Parallelism

• Layers divided into stages, each stage on a different GPU.

• Forward pass flows like an assembly line.

(d) Mixed Precision Training

• Use float16 (half precision) instead of float32.

• Reduces memory + speeds up training.

• Dynamic loss scaling ensures stability.


16.5 Scaling Laws

Researchers (OpenAI, DeepMind, etc.) found empirical scaling laws:

• Performance improves predictably as we scale:

o Model size (# parameters) ↑

o Dataset size ↑

o Compute power ↑

• But returns diminish unless all three are balanced.

Rule of thumb:
Bigger models need more data and compute to reach full potential.

16.6 Checkpointing & Distributed Training

• Checkpointing: Save model states periodically (to resume after crashes).

• Distributed Optimizers:

o Adam / AdamW commonly used.

o Optimizers need to sync across GPUs → communication overhead.

16.7 Challenges in Training Large Transformers

1. Compute Cost

o Training GPT-3 reportedly cost millions in GPU hours.

2. Energy & Carbon Footprint

o Large models consume significant electricity.

3. Overfitting on Small Data

o Risk if fine-tuning on niche tasks without regularization.

4. Catastrophic Forgetting

o Model may forget previous knowledge if fine-tuned improperly.


16.8 Example: GPT-3 Training Setup

• 175 billion parameters.

• Trained on ~500 billion tokens.

• Mixture of datasets: Common Crawl, books, Wikipedia, code.

• Required thousands of GPUs for weeks.

• Used data parallelism + model parallelism + mixed precision.

16.9 Future Directions

• More Efficient Transformers: Reduce O(n²) attention cost (Longformer, Performer).

• Low-Rank Adaptation (LoRA): Fine-tune large models efficiently by updating only a


few parameters.

• Quantization: Compress models to 8-bit or 4-bit weights to run on smaller devices.

• Distillation: Train smaller student models from large teacher models.


Sequence Models – Lecture 17: Ethical and Societal Considerations

17.1 Motivation

• Transformers and large language models (LLMs) are extremely powerful.

• With great power comes real-world impact:

o Biases

o Safety concerns

o Misuse

o Environmental costs

This lecture asks: “Not just can we build them, but should we build them this way?”

17.2 Bias in Models

• Models learn from data → and data reflects human biases.

Examples

• Gender bias: “doctor” → male association; “nurse” → female.

• Racial bias: certain names associated with negative sentiment.

• Social stereotypes amplified by autocomplete or summarization.

Because attention models memorize patterns, they reproduce and sometimes amplify
those biases.

17.3 Fairness

• Fairness = ensuring models don’t discriminate across groups.

Challenges:

1. Hard to define fairness universally.

2. Fairness in one metric may harm another.

3. Training data rarely balanced.


Approaches:

• Balanced datasets.

• Post-processing (debiasing embeddings).

• Human oversight.

17.4 Safety Concerns

• Large models can generate:

o Toxic or offensive language.

o Misinformation / hallucinations.

o Instructions for harmful activities.

Examples:

• A chatbot giving medical advice that is wrong.

• A code model suggesting insecure code snippets.

17.5 Misuse Potential

• LLMs can be weaponized for:

o Fake news at scale.

o Phishing emails that adapt to targets.

o Deepfake generation (paired with vision models).

• Threat: democratization of harmful tools.

17.6 Privacy Concerns

• Training data often scraped from the web.

• Risk of models memorizing personal info (emails, phone numbers, etc.).

• Privacy-preserving techniques:

o Differential privacy.
o Federated learning.

o Data filtering before training.

17.7 Environmental Impact

• Training massive models = huge energy cost.

• Example: GPT-3 estimated to use megawatt-hours of electricity.

• Carbon footprint concern: data centers, GPU farms.

• Push for efficient models (quantization, distillation).

17.8 Regulation and Governance

• Calls for regulation on:

o Data usage (consent, copyright).

o Responsible deployment (misinformation controls).

o Transparency (what data was used? how is it processed?).

• AI Ethics guidelines by organizations (EU AI Act, IEEE standards, etc.).

17.9 Human-in-the-Loop

• One solution: keep humans in decision-making.

• Example: AI generates medical advice → doctor reviews before giving to patient.

• Improves reliability + accountability.

17.10 Summary

Ethical challenges with LLMs:

1. Bias

2. Fairness

3. Safety
4. Misuse

5. Privacy

6. Environmental costs

It’s not just about making models bigger, but also making them responsible and safe.

Sequence Models – Lecture 18: Advanced Architectures & Variants

18.1 Motivation

• Standard Transformers are powerful but come with limitations:

1. Quadratic Complexity: self-attention is O(n²) in sequence length.

2. Memory Cost: long sequences (like books, DNA, videos) are hard to process.

3. Compute Hungry: billions of parameters require huge infrastructure.

Variants are developed to make Transformers faster, lighter, and more specialized.

18.2 Efficient Transformers (Long Sequence Models)

(a) Sparse Attention

• Instead of attending to all tokens, attend only to a subset.

• Longformer:

o Local attention (sliding window) + occasional global tokens.

o Reduces cost to O(n).

(b) Linformer

• Projects keys/values to lower-dimensional space.

• Attention becomes approximate but much faster (O(n)).

(c) Performer

• Uses kernel tricks to approximate attention.

• Replaces softmax(QKᵀ) with fast linear operations.

(d) Reformer
• Combines locality-sensitive hashing with reversible layers.

• Memory-efficient, good for very long sequences.

18.3 Hybrid Models

• Mix Transformer ideas with other architectures.

(a) RNN + Attention Hybrids

• Sometimes reintroduce recurrence for efficiency.

• Example: Transformer-XL → adds recurrence across segments to capture very long-


term dependencies.

(b) Convolution + Attention

• ConvS2S + Transformer → CNNs for local features, attention for global context.

• Example: Image captioning using CNN encoder + Transformer decoder.

18.4 Lightweight Transformers

• Goal: run on smaller devices (phones, IoT).

Examples

1. DistilBERT

o Compressed BERT (40% smaller, 60% faster).

o Uses knowledge distillation.

2. ALBERT

o Parameter-sharing across layers.

o Embedding factorization.

3. TinyBERT / MobileBERT

o Optimized for mobile deployment.

These make Transformers usable in real-time apps.


18.5 Multimodal Variants

• Vision Transformers (ViT) → handle images as patch sequences.

• CLIP → aligns text and image embeddings.

• Audio Transformers → process speech and music.

• Multimodal Transformers → combine vision, text, speech.

18.6 Domain-Specific Transformers

• Adapt Transformer to specific data types:

o Protein Folding (AlphaFold): sequences of amino acids.

o Genomics Transformers: DNA sequence analysis.

o Code Models (Codex, CodeT5): trained on programming languages.

18.7 Memory-Augmented Transformers

• Add external memory modules to improve recall.

• Example: Compressive Transformer: stores compressed versions of old hidden


states for long-term memory.

• Useful for tasks like summarizing entire books.

18.8 Adaptive Computation

• Instead of fixed depth, allow variable compute per token.

• Universal Transformer: same block applied multiple times with recurrence → more
flexibility.

• Dynamic Halting: different tokens may stop at different depths.

18.9 Summary Table of Variants


Variant Key Idea Example Use Case

Longformer Sparse local/global attention Long documents

Linformer Low-rank projection Fast training

Performer Kernelized linear attention Very long seqs

Reformer Hashing + reversible layers Efficient memory

Transformer-XL Recurrence across segments Long-term deps

DistilBERT / ALBERT Lightweight versions Mobile/NLP

ViT, CLIP Multimodal (vision + text) Images, multimodal

Compressive Trans. Memory-augmented Long stories

Universal Trans. Adaptive depth Flexible reasoning

18.10 Big Picture

• Standard Transformers = general-purpose.

• Variants = specialized for:

o Long sequences (Longformer, Reformer)

o Efficiency (DistilBERT, ALBERT)

o Multimodal data (CLIP, ViT)

o Domain tasks (protein, code)

This adaptability explains why Transformers dominate not only NLP, but also vision,
biology, and multimodal AI.
Sequence Models – Lecture 19: Applications Beyond NLP

19.1 Motivation

• Sequence models were born in NLP, but sequences exist everywhere:

o Images (sequence of patches).

o Audio (sequence of waveforms).

o Music (sequence of notes).

o Video (sequence of frames).

o Biological data (DNA/proteins).

The same principles (capturing dependencies across time/space) apply across fields.

19.2 Computer Vision

(a) Vision Transformers (ViT)

• Treat an image as a sequence of patches.

• Each patch is like a “token”.

• Self-attention models global relationships directly.

• Outperforms CNNs on large-scale datasets.

(b) DETR (DEtection TRansformer)

• End-to-end object detection model.

• Uses encoder-decoder Transformer to predict bounding boxes + classes.

• Simpler than CNN pipelines (no anchors, NMS).

(c) CLIP

• Joint vision-language model.

• Learns to align images with text descriptions.

• Enables zero-shot classification (“Which picture matches the caption?”).


19.3 Speech and Audio

(a) Speech Recognition

• Traditional pipeline = acoustic model + language model.

• Transformers replace both → direct speech-to-text.

• Example: Whisper (OpenAI).

(b) Speech Synthesis

• Sequence model maps text → audio waveforms.

• Example: Tacotron 2 + WaveNet pipeline.

(c) Audio Classification

• Transformers capture temporal dependencies in spectrograms.

• Used in music tagging, sound event detection.

19.4 Music Generation

• Music is inherently sequential (notes, chords, timing).

• Music Transformer: uses relative positional encoding → models rhythm + long-


range melody.

• Can generate convincing piano compositions.

Other models:

• Jukebox (OpenAI): generates raw audio with lyrics + instrumentation.

• MuseNet: composes multi-instrumental pieces.

19.5 Reinforcement Learning

• Environments = sequence of states, actions, rewards.

(a) Decision Transformer

• Formulates RL as a sequence modeling problem.

• Input: past states + actions + return-to-go.


• Output: next action.

• No need for explicit Q-learning.

(b) Trajectory Transformers

• Model sequences of trajectories (paths taken by agents).

• Used for planning in games, robotics.

Shows how Transformers generalize beyond “language” into decision-making.

19.6 Multimodal Applications

• Real-world AI often involves multiple modalities.

Examples:

1. Image + Text

o CLIP: aligns them in shared space.

o DALL·E / Stable Diffusion: text → image generation.

2. Speech + Text

o Whisper: speech → text transcription.

o SpeechT5: text ↔ speech (recognition + synthesis).

3. Video + Text

o Video Transformers: sequence of frames → captions.

o Applications in video retrieval, summarization.

19.7 Scientific Applications

• Protein Folding (AlphaFold)

o Amino acid sequences → 3D structure prediction.

o Revolutionized biology.

• Genomics

o DNA sequences modeled like text.


o Helps identify mutations, predict gene expression.

• Chemistry & Drug Discovery

o Molecules represented as SMILES strings (sequence of characters).

o Transformers generate new candidate molecules.

19.8 Finance & Time Series

• Stock prices, sensor data, weather forecasts = sequences.

• Transformers used for:

o Predicting future values.

o Anomaly detection in time series.

• Example: forecasting electricity demand, detecting fraud.

19.9 Robotics

• Robot sensors → sequential input (vision + motor readings).

• Sequence models help with:

o Planning.

o Control policies.

o Predicting next best action.

19.10 Big Picture

• Sequence models are general-purpose pattern recognizers.

• They unify:

o Language (text).

o Vision (images, video).

o Audio (speech, music).

o Science (genomics, proteins).


o Decision-making (RL, robotics).

The same Transformer block works everywhere — making it a universal architecture.

Sequence Models – Lecture 20: Future Directions & Research Frontiers

20.1 Motivation

• Transformers dominate AI today.

• But research is still evolving rapidly — exploring efficiency, reasoning, multimodality,


and alignment.

• This lecture asks: Where do we go from here?

20.2 Efficiency & Scaling

• Current large models are massive (billions → trillions of parameters).

• Future work focuses on getting more with less:

(a) Efficient Architectures

• Reduce quadratic attention cost.

• Approaches: Sparse attention, Performer, Linformer, low-rank factorizations.

(b) Model Compression

• Distillation: train smaller “student” from large “teacher.”

• Quantization: store weights at 8-bit/4-bit precision.

• Pruning: remove redundant neurons/heads.

(c) Scaling Laws Revisited

• Smarter scaling — balance model size, dataset size, compute.

• More emphasis on data quality over size.

20.3 Long-Term Reasoning


• Current models excel at local patterns, but struggle with multi-step reasoning.

• Research directions:

o Neuro-symbolic hybrids: combine Transformers with symbolic logic.

o Memory-augmented models: add external knowledge bases or long-term


memory.

o Chain-of-thought prompting: make reasoning steps explicit.

20.4 Multimodal Integration

• Real-world problems are multimodal.

• Future = unified models handling text, image, audio, video, and action jointly.

Examples:

• Text ↔ Image ↔ Video → DALL·E, Stable Diffusion, Flamingo.

• Text ↔ Speech → conversational AI.

• Text ↔ 3D/AR/VR → generative design and simulation.

Goal: a single model for all modalities (foundation models).

20.5 Personalization & Adaptation

• Models that adapt to individuals:

o Personalized assistants.

o Adaptive education tutors.

• Techniques:

o Few-shot learning.

o Parameter-efficient fine-tuning (LoRA, adapters).

o Federated learning (training without centralizing data).

20.6 Alignment & Safety


• Growing concern: how to align AI with human values.

• Directions:

o RLHF (Reinforcement Learning from Human Feedback).

o Constitutional AI → self-improvement guided by principles.

o Guardrails for safe generation.

• Research in interpretability: understanding what attention heads actually “do.”

20.7 Democratization & Accessibility

• Large models currently concentrated in a few labs.

• Push toward:

o Open-source alternatives (LLaMA, Mistral, Falcon).

o Running models on edge devices (phones, laptops).

o Cloud APIs making AI widely usable.

Big question: Who controls powerful AI?

20.8 Beyond Transformers

• Are Transformers the final architecture? Maybe not.

• Research areas:

o State Space Models (SSMs) → linear-time sequence modeling.

o Neural ODEs → continuous-time models.

o Graph Neural Networks (GNNs) → for relational reasoning.

o Potential post-Transformer architectures combining different paradigms.

20.9 Interdisciplinary Impact

• Sequence models reshaping multiple fields:

o Biology (drug discovery, protein folding).


o Physics (simulation, equation solving).

o Medicine (diagnosis, treatment recommendation).

o Social sciences (behavior modeling, economics).

AI becoming a scientific tool, not just a tech product.

20.10 Philosophical & Societal Questions

• If models reach human-level intelligence, what then?

• Issues to consider:

o AI and jobs (automation).

o Creativity (AI art, music).

o Responsibility (who is accountable for AI’s decisions?).

o Ethics (bias, fairness, safety).

20.11 Big Picture

• Transformers = today’s dominant paradigm.

• Future = balance of:

1. Efficiency (smaller, faster models).

2. Reasoning (multi-step, logical).

3. Multimodality (all data types).

4. Alignment (safe, ethical, human-centered).

5. Openness (democratized, accessible).

The story of sequence models is still being written.

Common questions

Powered by AI

Transformers have become dominant across various domains due to their scalability and versatility. They are capable of handling different modalities such as text, images, and audio, which makes them applicable to not only natural language processing but also to areas like computer vision and multimodal tasks . Furthermore, transformers are highly scalable, allowing for parallel training, which significantly reduces training time and allows handling of large datasets and model sizes . Their attention mechanisms also offer interpretability and adaptive focus on relevant parts of the input, which enhances performance across tasks and provides some insights into model reasoning .

Training large transformer models compared to traditional neural networks requires significantly more computational resources due to the large number of parameters and the scale of the datasets involved. Techniques like optimization tricks (e.g., learning rate scheduling, gradient clipping) and advanced parallelization methods (e.g., data parallelism, model parallelism) are essential to make the training feasible . Additionally, transformers benefit from pretraining on expansive corpora followed by fine-tuning for specific tasks, which necessitates massive datasets and computational infrastructure, often running for weeks or months on supercomputers .

Backpropagation Through Time (BPTT) is an extension of the backpropagation algorithm that is adapted for RNNs to calculate gradients for use in training. Unlike traditional backpropagation that moves layer by layer, BPTT involves unrolling the RNN through time and calculates the gradient by moving backwards from the output at each time step . BPTT is significant as it allows the error to be propagated through past time steps, enabling the RNN to adjust weights across the entire sequence effectively, thus handling dependencies between far-apart sequence elements .

Recurrent Neural Networks (RNNs) address the challenges faced by standard feed-forward neural networks in sequential data by sharing parameters across different time steps and maintaining a form of memory through hidden states. This allows RNNs to handle variable input/output lengths effectively. Additionally, the RNN's architecture enables pattern learning that can generalize across positions, unlike feed-forward networks that use separate parameters for each position, which can lead to parameter explosion when dealing with large vocabulary sizes .

The deployment of large language models presents ethical challenges related to bias, fairness, safety, and privacy. Large models can inadvertently propagate and amplify biases present in their training data . To mitigate these issues, strategies such as differential privacy, data filtering, and federated learning can be employed to minimize the privacy risks and enhance the fairness and neutrality of the model outputs . Moreover, implementing human-in-the-loop systems can improve the reliability and accountability of AI solutions by having humans review and approve outputs before deployment .

Efficient transformer models, such as Longformer, Linformer, and Performer, have introduced innovative mechanisms to handle long sequences and overcome the quadratic complexity of self-attention. Longformer reduces complexity by implementing a combination of local attention (where each token attends only to nearby tokens) with occasional global attention . Linformer projects the keys and values to a lower-dimensional space, making the attention approximate but computationally more feasible . Performer uses kernel methods to approximate attention with linear operations, significantly reducing the cost of maintaining long-range dependencies in sequences .

Standard RNNs have a limitation in that they can only access past information while making predictions, which means they cannot consider future context. This is problematic in tasks requiring context from both past and future inputs to make accurate predictions, such as in determining if 'Teddy' refers to a person or an object . Bidirectional RNNs address this limitation by having two parallel RNNs that run in opposite temporal directions - one from the past and one from the future - and these outputs are combined to make a prediction, effectively using both past and future inputs .

One-hot encoding in natural language processing involves representing each word as a sparse vector where one index is set to 1 and the rest are 0, based on a predefined vocabulary. For instance, the word 'Harry' would be represented as a vector with a 1 at the index specific to 'Harry' . However, one-hot encoding has limitations: it results in high-dimensional sparse vectors, which can be inefficient for computation and storage. Moreover, this method doesn't capture semantic similarities between words, as each word is treated as completely independent of others .

Parameter sharing in RNNs allows the model to greatly reduce the number of parameters needed as the same weights are used across all time steps. This helps in generalizing learned features across different positions in the data, which is crucial in sequences where the position of the element can vary . Contextual memory in RNNs enables the model to use both current input and the past context for making predictions, thus providing it with the ability to understand sequences better by accumulating information over time, unlike feed-forward networks that lack memory .

Many-to-Many architectures in RNNs can handle sequences of varying input and output lengths through structures like the encoder-decoder architecture, where an RNN is used to encode the input sequence into a context vector of fixed length. This is then decoded into an output sequence of possibly different length by another RNN . This flexibility allows for applications in tasks like machine translation, where the input sentence in one language might differ in length from its translation in another language .

You might also like