0% found this document useful (0 votes)
132 views37 pages

Generative Models For Text

The document provides a comprehensive guide on generative models for text, focusing on language models and their evolution from n-gram models to transformer architectures. It details the core components of language models, including tokenization, embeddings, and attention mechanisms, while explaining how transformers improve efficiency and capture long-range dependencies. Additionally, it discusses various text generation strategies and their applications in natural language processing tasks.

Uploaded by

Peddapuli siva
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views37 pages

Generative Models For Text

The document provides a comprehensive guide on generative models for text, focusing on language models and their evolution from n-gram models to transformer architectures. It details the core components of language models, including tokenization, embeddings, and attention mechanisms, while explaining how transformers improve efficiency and capture long-range dependencies. Additionally, it discusses various text generation strategies and their applications in natural language processing tasks.

Uploaded by

Peddapuli siva
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Generative Models for Text: A

Comprehensive Guide
1. Language Models Basics
Introduction to Language Models
A language model is a neural network that learns to estimate the probability of a text sequence and
uses that knowledge to generate or understand language. Language models form the foundation of
modern natural language processing (NLP) and generative AI systems.

Core Concept: A language model estimates the probability , which is often


factorized as a product of next-token probabilities: , representing the
probability of the next word given all previous words[1].

Historical Evolution

N-gram Models: Early approaches that estimated word probabilities based on


xed- size windows (e.g., trigrams)
Neural Language Models: Introduction of neural networks using embeddings and
hidden layers
RNN-based Models: Recurrent Neural Networks that could capture sequential
dependencies
Transformer Models: Modern architecture using self-attention mechanisms
(introduced 2017), enabling parallel processing and better long-range dependency
capture[1][2]
How Language Models Work
The fundamental training objective is next-token prediction (causal language modeling).
Given a sequence of tokens , the model learns to predict . This is
optimized using cross-entropy loss through gradient

descent. The mathematical formulation:

Where the model learns to maximize the probability of each token given its context.

Applications
Machine translation
Text summarization
Question answering
Code generation
Dialogue systems
Content creation
2. Building Blocks of Language Models
Core Components
A typical modern language model consists of the following building blocks[2][3]:

2.1Tokenization
De nition: Breaking down raw text into smaller units called tokens that the model
can process.
Word-level Tokenization: Each space-separated word becomes a token
Subword Tokenization: Breaking words into smaller
meaningful units
Byte Pair Encoding (BPE): Iteratively merging frequent byte
pairs WordPiece: Used by BERT, vocabulary size of 30,000
tokens
SentencePiece: Language-agnostic approach
Example: "The cat sat on the mat" becomes ["The", "cat", "sat", "on", "the", "mat"]
2.2Embeddings
De nition: Converting discrete tokens into continuous vector representations.
Token Embeddings: Dense vectors representing semantic meaning of each token
Learnable Parameters: Embeddings are learned during training to
capture
semantic relationships
Dimensionality: Typically 768-1024 dimensions for modern models
Position Embeddings: Special embeddings that encode the position of each token in
the sequence
Segment Embeddings: Additional embeddings for distinguishing di erent text
segments
The embedding layer transforms a token index into a high-dimensional vector that
captures semantic information:

2.3Encoder/Decoder Layers
These are the core computational units that process embeddings:
Attention Layers: Compute relationships between tokens
Feed-Forward Layers: Non-linear transformations applied independently to
each position
Layer Normalization: Stabilizes training by normalizing input

Residual Connections: Skip connections that prevent gradient vanishing

2.4Output Layer
Softmax Layer: Produces probability distribution over
vocabulary Linear Projection: Maps hidden states to
vocabulary size
Output shape:
The nal probability for token prediction:

Where is the output projection matrix and is the hidden state.


3. Transformer Architecture
Overview
The Transformer architecture, introduced by Vaswani et al. (2017), revolutionized NLP by replacing
recurrent connections with self-attention mechanisms. This allows parallel processing of entire
sequences, dramatically improving training e ciency and enabling better capture of long-range
dependencies[1][2][3].

Why Transformers?
Advantages over RNNs/CNNs:

✓ Parallel processing of sequences (unlike sequential RNNs)


✓ E cient handling of long-range dependencies
✓ Scalable to larger datasets and model sizes
✓ Better gradient ow during training
✓ Enables multi-head attention mechanisms

Transformer Architecture Overview


[ENCODER-DECODER STRUCTURE]
Figure 1: High-level Transformer architecture with encoder and decoder stacks The

Transformer consists of two main components:

3.1 Core Components


Encoder Stack:

Multiple identical layers (typically 12-24 layers)


Each layer contains self-attention and feed-forward sublayers
Processes input sequence bidirectionally
Outputs contextual representations Decoder Stack:
Multiple identical layers matching encoder depth
Contains masked self-attention, encoder-decoder attention, and feed-forward layers Generates output
sequence token-by-token
Uses previous tokens and encoder output

3.2 Positional Encoding

Problem: Transformers process sequences in parallel, so they don't inherently understand token order.

Solution: Add positional encodings to input embeddings to inject sequence order information.

Fixed Positional Encoding (Original Transformer): For

position and dimension index :

Learned Positional Encodings: Some models learn position embeddings directly from data (e.g.,
BERT uses absolute position embeddings).

Bene t: Di erent frequencies allow the model to attend to both local context (high frequencies) and
global patterns (low frequencies).

3.3 Layer Normalization and Residual Connections


Layer Normalization (LayerNorm):

Where and are computed per sample across features. Residual

Connections:

These connections skip layers and help preserve information, preventing gradient vanishing and
improving convergence[3].

3.4 Feed-Forward Networks


Position-wise Feed-Forward Network (FFN):

Applied independently to each position:

Hidden Dimension: Typically 4× the embedding dimension


Activation: ReLU or GeLU (Gaussian Error Linear Unit)
Purpose: Adds non-linearity and representational capacity
Uses previous tokens and encoder output

3.5 Positional Encoding

Problem: Transformers process sequences in parallel, so they don't inherently understand token order.

Solution: Add positional encodings to input embeddings to inject sequence order information.

Fixed Positional Encoding (Original Transformer): For

position and dimension index :

Learned Positional Encodings: Some models learn position embeddings directly from data (e.g.,
BERT uses absolute position embeddings).

Bene t: Di erent frequencies allow the model to attend to both local context (high frequencies) and
global patterns (low frequencies).

3.6 Layer Normalization and Residual Connections


Layer Normalization (LayerNorm):

Where and are computed per sample across features.

Residual Connections:

These connections skip layers and help preserve information, preventing gradient vanishing and
improving convergence[3].

3.7 Feed-Forward Networks


Position-wise Feed-Forward Network (FFN):

Applied independently to each position:

Hidden Dimension: Typically 4× the embedding dimension


Activation: ReLU or GeLU (Gaussian Error Linear Unit)
Purpose: Adds non-linearity and representational capacity
Full Transformer Layer
A complete Transformer encoder layer:

Input ──→ LayerNorm ──→ Multi-Head Attention ──→ Add (Residual)



LayerNorm ──→ Feed-Forward ──→ Add (Residual) ──→ Output

Stacking multiple layers allows the model to build increasingly abstract representations.

4. Encoder and Decoder Architectures

The Encoder-Decoder Framework


The Transformer uses an encoder-decoder architecture where:[3]

Encoder: Processes input and creates contextual representations


Decoder: Generates output using encoder representations and previously generated tokens

4.1 Encoder Architecture


Function: Transform input sequence into rich contextual representations. Structure:

Stack of identical layers (typically 6-12 layers for smaller models, 24+ for large ones) Each
layer has self-attention and feed-forward sublayers
No masking on attention (can see entire input) Key

Characteristics:

Bidirectional attention (each token sees all other tokens)


Processes entire input simultaneously
Produces output representation for each input token

Encoder Process Flow:

1. Input Embedding: Tokenize text and convert to embeddings


2. Add Positional Encodings: Inject position information
3. Self-Attention Layers: Each token attends to all other tokens
4. Feed-Forward Layers: Non-linear transformations
5. Layer Normalization: Stabilize and normalize
6. Output: Contextual representation for each input token

Mathematical Representation:

4.2 Decoder Architecture


Function: Generate output sequence token-by-token using encoder outputs and previously generated tokens.

Structure:

Stack of identical layers (same depth as encoder) Each


layer has three sublayers:
1. Masked self-attention (attends only to previous tokens)
2. Encoder-decoder attention (attends to encoder output)
3. Feed-forward network

Key Characteristics:

Masked Self-Attention: Prevents attending to future tokens (autoregressive property)


Cross-Attention: Attends to encoder outputs to incorporate input context
Sequential Generation: Generates one token at a time
Decoder Process Flow:

1. Input Embedding: Embed previously generated tokens


2. Add Positional Encodings: Add position information
3. Masked Self-Attention: Attend only to previous tokens
4. Encoder-Decoder Attention: Focus on relevant input parts
5. Feed-Forward Layer: Non-linear transformation
6. Output Projection: Predict next token
7. Softmax: Convert to probability distribution

4.3 Encoder-Only vs Decoder-Only vs Encoder-Decoder

Architecture Attention Pattern


Best For Examples
Type
Text understanding, BERT,
Encoder-Only Bidirectional
classi cation RoBERTa
GPT-2, GPT-
Decoder-Only Autoregressive Text generation
3, GPT-4
Encoder- Bidirectional + Translation, T5, BART,
Decoder Causal summarization mBART

Table 1: Comparison of Transformer architecture variants


5. Attention Mechanisms
Self-Attention: The Core of Transformers
De nition: A mechanism where each element in a sequence attends to all other elements, learning their
relative importance[2][3].

Why Self-Attention?

Captures long-range dependencies e ciently Allows


parallel processing of sequences
Provides interpretability through attention weights More e
cient than RNNs for long sequences

5.1 Scaled Dot-Product Attention


The fundamental attention operation:

Components:
Query (Q): What is this token looking for?
Computed as:
Shape:
Key (K): What can other tokens o er?
Computed as:
Shape:
Value (V): The actual content of tokens
Computed as:
Shape:
Scaling Factor : Prevents dot products from becoming too large
Empirically found that dividing by improves training stability

Attention Process Step-by-Step:

1. Compute attention scores:


2. Scale scores:
3. Apply softmax:
4. Weight values:

Intuition: Softmax converts scores to probabilities (0-1), and the model learns which tokens to
focus on. The output is a weighted sum of values, where weights indicate importance[3].

5.2 Multi-Head Attention


Motivation: Di erent heads can learn di erent types of relationships (syntax, semantics, coreference,
etc.).

How It Works:

Instead of computing attention once, compute it times (usually 8-16 heads) in parallel:

Where each head computes:

Bene ts:

✓ Di erent heads capture di erent representation subspaces


✓ Richer expressiveness without massive increase in computation
✓ Each head operates on dimensions
✓ Parallel computation enables e ciency

Example Architecture:

For a 768-dimensional model with 12 heads: each head operates on 64-dimensional vectors
Total computation cost is similar to single-head attention on full dimension

5.3 Encoder Self-Attention


Characteristic: Bidirectional - each token can attend to all other tokens. Attention

Mask: No masking applied; the full attention matrix is computed. Token positions: [1 2 3
4 5]
Can attend to: [✓ ✓ ✓ ✓ ✓]

Use Case: Understanding full context for classi cation, entity recognition, question answering.

5.4 Decoder Masked Self-Attention


Characteristic: Autoregressive - each token can only attend to previous tokens. Attention

Mask: Set future positions to negative in nity before softmax.

Token positions: [1 2 3 4 5] Can


attend to:

Token 1: [✓ ✗ ✗ ✗ ✗]
Token 2: [✓ ✓ ✗ ✗ ✗]
Token 3: [✓ ✓ ✓ ✗ ✗]
Token 4: [✓ ✓ ✓ ✓ ✗]
Token 5: [✓ ✓ ✓ ✓ ✓]

Implementation:

Where is a mask matrix with if (future positions).

Bene t: Prevents "cheating" during training and maintains the autoregressive property during
generation[3].

5.5 Encoder-Decoder Cross-Attention


Function: Allows decoder to focus on relevant parts of encoder output. Mechanism:

Query: From decoder (what am I generating?)


Key/Value: From encoder (what context is available?)

Process:

Decoder positions attend to all encoder positions Allows


selective focus on input context
Helps with alignment in translation and summarization
6. Generation of Text
Text Generation Process
Text generation in Transformer models follows an autoregressive decoding approach where tokens
are generated sequentially, one at a time[1][2].

6.1 Autoregressive Generation Pipeline


1. Encode input sequence: Pass input through encoder to get context representations
2. Initialize decoder: Provide start-of-sequence token (e.g., [START] or <s>)
3. Generate iteratively:
• Run decoder with current sequence
• Get probability distribution over vocabulary
• Select next token using decoding strategy
• Append to sequence
4. Stop when: End-of-sequence token generated or max length reached

Mathematical Formula:

6.2 Decoding Strategies


Di erent strategies for selecting the next token from the probability distribution:

Greedy Decoding
Algorithm: Always select the token with highest probability.
Advantages:

✓ Fast and deterministic


✓ Simple to implement
✓ Reproducible

Disadvantages:

✗ Often produces repetitive text


✗ Can get stuck in local optima
✗ Less diverse outputs

Beam Search
Algorithm: Keep track of top-k hypotheses and expand based on probabilities. Process:

Maintain (beam width) best partial sequences For


each sequence, generate top-k next tokens
Prune to keep only top-k sequences
Repeat until all sequences reach end token Parameters:

Beam Width: Typically 3-5 for general tasks, up to 10 for high-quality generation
Length Penalty: Penalize very short or very long sequences
Advantages:

✓ Better quality than greedy


✓ More diverse outputs
✓ Controlled exploration-exploitation balance

Disadvantages:

✗ Slower than greedy (requires maintaining multiple hypotheses)


✗ Can still be repetitive at high beam widths

Top-K Sampling
Algorithm: Sample from top-k most likely tokens. Process:
1. Compute probability for each token
2. Sort probabilities in descending order
3. Keep only top-k tokens
4. Renormalize probabilities among top-k
5. Sample from this restricted distribution

Parameter: (typically 50)

Advantages:

✓ Prevents sampling of very unlikely tokens


✓ More diverse and natural than greedy
✓ Controllable randomness

Disadvantages:
✗ Less stable than deterministic methods

Nucleus Sampling (Top-P Sampling)


Algorithm: Sample from smallest set of tokens whose cumulative probability exceeds threshold .

Process:

1. Sort tokens by probability (descending)


2. Accumulate probabilities until exceeding threshold
3. Renormalize remaining probabilities
4. Sample from this dynamic set

Parameter: (typically 0.9)

Advantages:

✓ Adaptive set size based on probability distribution


✓ More natural diversity than xed top-k
✓ Removes "probability tails" automatically

Common Generation Parameters:


Parameter Default E ect
Higher = more random; Lower = more
temperature 1.0
deterministic
top_k 50 Limits to top-k tokens
top_p 0.9 Nucleus sampling threshold
max_length 100 Maximum generation length
num_beams 1 1 = greedy; >1 = beam search

Table 2: Common text generation hyperparameters

6.3 Practical Generation Example


Input: "The future of AI is"

Decoding process:

1. Encode prompt
2. Start with [START] token
3. Generate token 1: options {very (0.35), bright (0.25), uncertain (0.15), ...}
Greedy: select "very"
Beam: consider "very", "bright", "uncertain"
4. Generate token 2 with context [START] very: options {bright (0.40), important (0.20),
...}
Continue expanding...
5. Stop when [END] token or max length reached

Result: "The future of AI is very bright and will transform society..."


7. BERT: Bidirectional Encoder Representations from
Transformers
Overview
BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018,
revolutionized NLP by introducing a bidirectional pre-training approach. It dramatically improved
the state-of-the-art for language understanding tasks[4].

Key Innovation
Previous Approaches: Models like GPT used unidirectional (left-to-right) pre-training.

BERT's Innovation: Trains bidirectionally using masked language modeling, allowing the model to see
context from both directions:

Example: "The cat sat on the [MASK]"

Left context: "The cat sat on the"


Right context: "" (no tokens to the right)
BERT sees both: Can use "the", "cat", "sat", "on" to predict "[MASK]"

This bidirectional nature gives BERT superior understanding capabilities[4].


7.1 BERT Architecture
Type: Encoder-only Transformer

Architecture Notation: (L = number of layers, H = hidden size) Common


variants:

Model Layers Hidden Size Parameters


BERT 2 128 4M
BERT 12 768 110M
BERT 24 1024 340M

Table 3: BERT model size variants Core

Layers:
1. Embedding Layer:
• Token Embeddings: WordPiece vocabulary of 30,000 tokens
• Position Embeddings: Absolute position encodings
• Segment Embeddings: Distinguish rst and second text segments
• Layer Normalization: Normalize combined embeddings
2. Encoder Stack:
• 12 or 24 Transformer encoder layers (depending on variant)
• Each layer: Multi-head self-attention + Feed-forward
• Bidirectional attention (no masking)
3. Task Head:
• Output layer for pre-training tasks
• Removed for downstream tasks and replaced with task-speci c head

7.2 BERT Pre-training


BERT is pre-trained on two objectives simultaneously[4]:

Masked Language Modeling (MLM)


Task: Predict randomly masked tokens given surrounding context.

Process:

1. Randomly select 15% of tokens


2. For each selected token:
80% of the time: replace with [MASK] token
10% of the time: replace with random token 10%
of the time: keep unchanged

Reason for Randomness: Prevents dataset shift. During inference, no [MASK] tokens appear, so during
training they should sometimes be absent.

Example:
Original: "The cat sat on the mat"
Masked: "The [MASK] sat on the [MASK]"
Task: Predict "cat" and "mat" from context

Bene t: Model learns bidirectional context since it must predict from both directions.

Next Sentence Prediction (NSP)


Task: Given two sentences, predict if they appear consecutively in the corpus. Process:

1. Take two sentences (50% consecutive, 50% random)


2. Format as: [CLS] sentence_1 [SEP] sentence_2 [SEP]
3. Predict [IsNext] or [NotNext]

Special Tokens:

[CLS]: Classi cation token, always rst token


[SEP]: Separator token between sentences
[MASK]: Masked token placeholder Example:

IsNext: "[CLS] The cat sat on the mat [SEP] It was comfortable [SEP]"
NotNext: "[CLS] The cat sat on the mat [SEP] How do magnets work? [SEP]"

Training Details:

Dataset: BookCorpus (800M words) + English Wikipedia (2,500M words) Time: 4


days on 4 Cloud TPUs for BERT , cost ~$500 USD

7.3 BERT Fine-tuning


After pre-training, BERT can be ne-tuned for speci c tasks with minimal data. Fine-

tuning Tasks:

Task Category Examples


Text Classi cation Sentiment analysis, topic classi cation
Token Classi cation Named Entity Recognition (NER)
Sentence Classi cation Textual entailment, paraphrase detection
Extractive QA SQuAD - nding answer spans in passages
Generative QA Generating full answer text
Semantic Similarity Measuring similarity between sentences

Table 4: Common BERT ne-tuning tasks

Fine-tuning Process:
1. Load pre-trained BERT weights
2. Add task-speci c output layer
3. Fine-tune on task-speci c dataset (typically 1 hour on 1 TPU for BERT )
4. Achieve state-of-the-art performance with modest data

Performance: BERT achieved state-of-the-art on multiple benchmarks[4]:

GLUE (General Language Understanding Evaluation) SQuAD


(Stanford Question Answering Dataset) SWAG (Situations
With Adversarial Generations)

7.4 BERT Strengths and Limitations


Strengths:

✓ Bidirectional context understanding


✓ Strong performance on understanding tasks
✓ E cient ne-tuning
✓ Excellent for classi cation and extraction

Limitations:

✗ Cannot generate text (encoder-only)


✗ Di cult to adapt for open-ended generation
✗ Requires special techniques for text generation (e.g., masking predictions)

7.5 BERT Variants


The BERT architecture inspired many variants addressing speci c needs[4]:

RoBERTa (2019): Improved training procedure, larger batches, better hyperparameters


DistilBERT (2019): Smaller model (66M parameters) retaining 95% performance XLM-
RoBERTa (2019): Multilingual variant supporting 100+ languages
ALBERT (2019): Parameter-e cient with shared layers and factorized embeddings ELECTRA (2020):
Discriminator-based pre-training replacing MLM
DeBERTa (2020): Disentangled attention separating position and content

8. GPT Models: Generative Pre-trained Transformers


Overview
GPT (Generative Pre-trained Transformer) models, developed by OpenAI, represent the decoder-only
Transformer architecture designed speci cally for text generation. They use autoregressive pre-
training and have achieved remarkable capabilities in few-shot learning and open-ended
generation[2][3].

Key Principle
Prediction Paradigm: Train to predict the next token given all previous tokens:

This simple objective leads to remarkably capable models.


8.1 GPT Architecture
Type: Decoder-only Transformer Characteristics:

Uses masked self-attention (causal masking) Can


only attend to previous tokens
Designed for autoregressive text generation Why

Decoder-Only?

Simplicity: Single attention pattern (masked attention)


E ciency: Can cache previous tokens for faster generation Capability:
Autoregressive nature aligns with generation task

8.2 GPT Evolution

Model Year Parameters Key Advancement


GPT 2018 117M Introduced unsupervised pre-training
GPT-2 2019 1.5B Showed strong generative abilities
GPT-3 2020 175B Few-shot and zero-shot capabilities
Improved reasoning, multimodal abilities
GPT-4 2023 --
GPT-
2025 -- Bridge between GPT-4 and GPT-5
4.5
GPT-
2025 -- Better coding, long context (1M tokens)
4.1
Dynamic routing, reasoning improvements
GPT-5 2025 --

Table 5: Evolution of GPT models


8.3 GPT Pre-training and Fine-tuning
Two-Phase Training:

Phase 1: Unsupervised Pre-training

Train on vast amounts of unlabeled text data


Objective: Predict next token
Learn general language patterns and factual knowledge Cost:
Hundreds of GPU-days to months for large models

Phase 2: Instruction Tuning (Supervised Fine-tuning)

Fine-tune on high-quality instruction-response pairs Use


supervised learning from human feedback (RLHF) Learn to follow
instructions

Align outputs with human preferences

This two-phase approach enables models to understand diverse tasks while maintaining safety and
usefulness[2].

8.4 GPT Capabilities


Modern GPT models demonstrate impressive capabilities[2]:

Capability Description Example


Creating coherent, creative
Text Generation Writing essays, stories
text
Condensing long
Summarization Executive summaries
documents
Converting between
Translation English Spanish
languages
Question Providing factual answers
Trivia questions
Answering
Writing or completing code Function implementation
Code Generation

Step-by-step problem
Reasoning Math problems
solving
Few-Shot Learning from examples Classi cation from 2-3
Learning in prompt examples
Multimodal (GPT- Image captioning, visual
Understanding images
4V) QA

Table 6: Capabilities of modern GPT models

8.5 Advantages and Limitations


Advantages:

✓ Excellent at text generation tasks


✓ Strong few-shot learning capabilities
✓ Versatile across diverse tasks
✓ Impressive reasoning abilities (especially newer models)
✓ Scalability: Performance improves with scale

Limitations:

✗ Can generate false information (hallucination)


✗ Computationally expensive (requires massive resources)
✗ Can inherit biases from training data
✗ Di cult to interpret reasoning
✗ Fine-tuning less e cient than encoder models (like BERT)
8.6 BERT vs GPT Comparison

Aspect BERT GPT


Architecture Encoder-only Decoder-only
Autoregressive (masked)
Attention Bidirectional
Pre-training MLM + NSP Next-token prediction
Best for Understanding Generation
Text Generation Di cult Natural and strong
Few-Shot Learning Limited Excellent
Possible but less e
Classi cation Excellent
cient
Fine-tuning
High Lower
E ciency
Smaller (340M max
Model Size Larger (175B+)
public)

Table 7: BERT vs GPT comparison


9. Autoencoding and Regression Models
Autoencoding in NLP
De nition: Autoencoding models learn to reconstruct corrupted or masked input, developing robust
latent representations in the process.

9.1 Masked Autoencoding Approach


Concept: Corrupt input and train model to recover original. Methods:

BERT-style Masking
Mask random tokens and predict them:

Original: "The quick brown fox jumps" Masked:


"The [MASK] brown fox [MASK]" Task: Predict
"quick" and "jumps"

BART-style Denoising
Replace spans of text with mask tokens:

Original: "The quick brown fox jumps over the lazy dog"
Masked: "The [MASK] jumps over the lazy dog" Corrupted
span: "quick brown fox"
Task: Predict entire corrupted span

T5-style Pre x Modeling


• Corrupt input by replacing random spans with <X> tokens
• Add pre x indicating which span is corrupted
• Example: "corrupt_X: The [X] jumps over the lazy dog"
• Decoder must generate: "quick brown fox"

9.2 Bene ts of Autoencoding


✓ Bidirectional context learning
✓ Robust representations
✓ Excellent for understanding tasks
✓ Can be adapted for generation (with techniques like pre x generation)

9.3 Regression Models in NLP


While less common than classi cation, regression models address continuous-valued predictions in NLP.

Applications:

• Sentiment Scores: Predict sentiment on 1-5 scale instead of binary classi cation
• Readability Scores: Predict text complexity/grade level
• Semantic Similarity: Regression between sentence pairs (0-1 similarity)
• Machine Translation Quality: Predict quality score of translations
• Toxicity Scores: Continuous toxicity level of text
• Comprehension Level: Estimate required reading level

Architecture:

For regression tasks, the standard approach is:

1. Use encoder model (like BERT) to get contextual representations


2. Add regression head (linear layer outputting continuous values)
3. Train with regression loss (e.g., Mean Squared Error)

Where and output is a scalar.

Loss Function:

Or for more robust learning, use Mean Absolute Error (MAE):


[Link] ChatGPT
Overview
ChatGPT is a conversational AI system built on top of GPT-style Transformer decoders, ne- tuned with
reinforcement learning from human feedback (RLHF) for helpful, safe, and aligned dialogue[5].

10.1 ChatGPT Architecture


Base: GPT-3.5 or GPT-4 core model

Additional Training Layers:

1. Supervised Fine-tuning (SFT):


• High-quality instruction-response pairs
• Model learns to follow instructions
• Improve coherence and relevance
2. Reinforcement Learning from Human Feedback (RLHF):
• Humans rate model outputs
• Train reward model to predict human preferences
• Fine-tune using PPO (Proximal Policy Optimization)
• Align with human values and preferences
3. Safety Training:
• Additional ne-tuning to avoid harmful outputs
• Mitigation of biases and misinformation
• Safety-focused reinforcement learning

10.2 Key Capabilities

Capability Examples
Question Answering factual, creative, and complex questions
Answering
Explanations Explaining concepts at various complexity levels
Code Generation Writing code, debugging, algorithm explanation
Writing Essays, stories, emails, creative writing
Translation Translating between languages
Summarization Condensing long texts into summaries
Analysis Analyzing arguments, documents, problems
Brainstorming Generating ideas and creative solutions
Conversation Engaging in multi-turn dialogue
Reasoning Step-by-step problem solving

Table 8: ChatGPT primary capabilities


10.3 Training Advantages
Instruction Tuning Bene ts:

Models respond better to natural language instructions Fewer


examples needed for task speci cation
More aligned with human intent
Improved safety and helpfulness

RLHF Bene ts:

Outputs directly optimized for human preferences Better


alignment with user satisfaction
Reduced harmful outputs More
nuanced responses

10.4 Conversational Context


ChatGPT maintains conversation history through:

Context Window: The model's maximum input length determines how much history it can use.

Earlier versions: 4K tokens (~3000 words)


Recent versions: 128K tokens (~95,000 words) Latest
models: Up to 1M tokens (GPT-4.1)

Context Usage:
System: "You are a helpful AI assistant." User:
"What is machine learning?" Assistant:
"[Explanation of ML]"
User: "How does that relate to deep learning?"
Assistant: "[Uses previous context about ML to explain relation]" The full

conversation is passed as context for each generation.

10.5 Multimodal ChatGPT


Modern ChatGPT versions (GPT-4 Vision) can process images: Capabilities:

Image captioning: Describe what's in images


Visual question answering: Answer questions about images
Document analysis: Extract text from scanned documents
Diagram understanding: Interpret charts, graphs, diagrams
Code visualization: Understand code structure from screenshots

Input Format:
User: [Image of handwritten equation] "What is this equation?"
ChatGPT: "[Recognizes equation and explains it]"
10.6 Limitations and Considerations
Limitations:

✗ Knowledge cuto : Training data has a cuto date


✗ Hallucination: Can generate plausible-sounding false information
✗ Reasoning limitations: Struggles with complex multi-step logic
✗ Current events: Cannot access real-time information
✗ Bias: May re ect biases in training data
✗ Token limits: Cannot process extremely long documents Ethical

Considerations:

Potential for misuse (misinformation, deception)


Authorship questions for generated content
Job displacement in certain elds
Environmental cost (energy consumption for training and inference) Privacy
concerns with data usage

10.7 ChatGPT Variants and Evolution

Version Release Improvements


ChatGPT-3.5 Nov 2022 Original, accessible, ~175B parameters
ChatGPT-4 Mar 2023 Improved reasoning, multimodal (vision)
ChatGPT-4 Turbo Longer context (128K), improved knowledge
Nov 2023
ChatGPT-4o May 2024 Optimized, faster, better capabilities
Late Advanced reasoning, mathematical abilities
ChatGPT-o3
2024

Table 9: ChatGPT versions and evolution

10.8 ChatGPT vs BERT vs GPT

Aspect BERT GPT ChatGPT


Base Architecture Decoder (GPT- based)
Encoder Decoder
Purpose Understanding Generation Conversation
Instruction tuning
Fine-tuning Task-speci c RLHF + SFT
Best Use Classi cation Generation Dialogue
None Optimized for
Alignment None (general)
(general) human
Safety Training Minimal Minimal Extensive
Table 10: Comparison of BERT, GPT, and ChatGPT

Summary and Key Takeaways


Evolution of Language Models
1. Traditional Models (n-grams, count-based): Limited context and capacity
2. Neural Language Models (RNN, LSTM): Better context, but sequential bottleneck
3. Transformer Era (2017 onwards):
• Self-attention enabling parallel processing
• Bidirectional models (BERT) for understanding
• Autoregressive models (GPT) for generation
• Encoder-decoder for sequence-to-sequence tasks
4. Scale (2018-2025): Scaling laws showing better performance with more
data/parameters
5. Alignment (2023 onwards): RLHF and instruction tuning for safety and usefulness

Architectural Paradigms
Three main approaches each suited for di erent tasks:

Architecture Characteristics Best For


Understanding, classi
Encoder-Only Bidirectional attention
cation
Generation, few-shot
Decoder-Only Autoregressive attention
learning
Encoder- Decoder Bidirectional + Translation,
Autoregressive summarization

Table 11: Summary of transformer architectures

Key Components
Understanding these components is essential for working with modern language models:

Embeddings: Represent discrete tokens as continuous vectors Positional


Encoding: Inject order information
Self-Attention: Learn relationships between tokens Multi-Head
Attention: Capture diverse relationship types Feed-Forward
Networks: Add non-linearity
Layer Normalization: Stabilize training Residual
Connections: Improve gradient ow
Text Generation Techniques
Choose decoding strategies based on use case:

Greedy: Fast, simple, but repetitive Beam


Search: Better quality, slower Top-K
Sampling: Diverse, stochastic Nucleus
Sampling: Adaptive diversity

Current Trends (2025)


1. Larger Scale: Models with 100B+ parameters
2. Longer Context: From 4K to 1M+ tokens
3. Multimodal: Integration of text, images, audio
4. Reasoning: Improved step-by-step reasoning capabilities
5. E ciency: Smaller models with better performance
6. Safety: Stronger alignment and safety measures

Practical Applications
Content Generation: Articles, emails, creative writing Data
Analysis: Summarization, information extraction Customer
Service: Chatbots, Q&A systems
Code: Generation, completion, debugging
Education: Tutoring, explanation, question answering Healthcare:
Report generation, literature review Research: Paper
summarization, hypothesis generation

References
[1] Vaswani, A., et al. (2017). Attention is all you need. Neural Information Processing
Systems.

[2] Hugging Face LLM Course. (2024). Transformer Architectures.


[Link]

[3] GeeksforGeeks. (2024). Architecture and working of transformers in deep learning. Retrieved
from [Link] ransformers-in-
deep-learning/

[4] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[5] OpenAI Help Center. (2025). ChatGPT capabilities overview. Retrieved from [Link]
[Link]/en/articles/9260256-chatgpt-capabilities-overview

[6] GeeksforGeeks. (2024). Introduction to Generative Pre-trained Transformer (GPT). Retrieved


from [Link] cial-intelligence/introduction-to-genera tive-pre-trained-
transformer-gpt/

[7] Wikipedia. (2024). BERT (language model). Retrieved from [Link]


BERT_(language_model)

Common questions

Powered by AI

GPT models follow a decoder-only architecture suited for autoregressive text generation tasks, focusing on predicting the next token sequentially, making them ideal for text generation and few-shot learning. In contrast, BERT models use an encoder-only architecture designed for understanding tasks like classification, benefiting from bidirectional attention to process input context more comprehensively .

A Transformer's encoder stack consists of multiple identical layers. Each layer contains self-attention and feed-forward sublayers that process the input sequence bidirectionally, outputting contextual representations for each input token. By stacking multiple layers, the model can build more abstract representations, and features like the positional encoding provide sequence order information, while layer normalization and residual connections improve convergence and information preservation .

Transformers address the sequence order information loss issue through positional encodings. These encodings are added to input embeddings to provide the model with token order information, enabling it to consider sequence position during processing. Different models implement this solution with either fixed positional encodings or learned position embeddings, such as the absolute position embeddings used in BERT .

Scaling in Transformer architectures significantly enhances their performance and applicability by allowing them to handle larger datasets and model sizes more effectively. Increased model size generally translates to improved representation capabilities, potentially leading to state-of-the-art results in various tasks. However, it also demands more computational resources and careful balance to avoid detrimentally large models that can become inefficient or impractical .

Multi-head attention is important because it enables different attention heads to learn different types of relationships such as syntax and semantics. This parallel computation allows the model to capture richer representation subspaces without significantly increasing computational requirements, resulting in more expressive models that can efficiently handle complex tasks .

Masked self-attention layers preserve the autoregressive property by preventing each token from attending to future tokens during inference. This is achieved by applying a mask that blocks future positions with negative infinity values before softmax is applied, ensuring that predictions are made only based on past and current tokens in the sequence, crucial for tasks like text generation .

The Transformer architecture improves training efficiency and handles long-range dependencies better than RNNs by using self-attention mechanisms instead of recurrent connections. This allows for the parallel processing of sequences, increasing training efficiency and enabling the model to capture dependencies across long distances within the input data .

Residual connections are crucial in Transformer training by allowing gradients to flow back through multiple layers more effectively, thus preventing the vanishing gradient problem. They facilitate the preservation of information across Transformer layers, enhancing model convergence and stability during training, which is particularly beneficial given the model's depth .

RLHF significantly enhances conversational models like ChatGPT by aligning output with human values and preferences. This training method allows models to optimize directly for human feedback, improving user satisfaction and safety by reducing harmful outputs. However, implementing RLHF can be complex and requires carefully managed instructional data and user interactions .

Autoregressive decoding strategies, including step-by-step generation of text, deeply influence GPT models by ensuring sequential processing, where each token is predicted based on previous ones. This approach aligns with natural language flow but also risks issues like repetitive outputs and local optima. Strategies like beam search and sampling methods are employed to balance quality and diversity in text generation .

You might also like