0% found this document useful (0 votes)
9 views366 pages

U1 Merged

The document outlines a course on Generative AI and its applications, focusing on NLP basics and word embeddings, led by Dr. Arti Arya. It covers key concepts such as Generative AI models, including GANs and Transformers, and their evolution into Large Language Models (LLMs) like ChatGPT. Additionally, it discusses the processes of pretraining and finetuning LLMs, as well as the importance of tokenization and text normalization in NLP.

Uploaded by

Aathil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views366 pages

U1 Merged

The document outlines a course on Generative AI and its applications, focusing on NLP basics and word embeddings, led by Dr. Arti Arya. It covers key concepts such as Generative AI models, including GANs and Transformers, and their evolution into Large Language Models (LLMs) like ChatGPT. Additionally, it discusses the processes of pretraining and finetuning LLMs, as well as the importance of tokenization and text normalization in NLP.

Uploaded by

Aathil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Generative AI and Its Applications

Unit 1

(UE22CS342BA9)
NLP Basics and Word Embeddings

Course Instructor: Dr. Arti Arya


Dept. of CSE, PES University, Bangalore
Generative AI and Its Applications
Acknowledgement

The slides are prepared from various resources from the Universities from abroad and
India. Also, some material is taken from reliable resources from internet throughout this
course. Some slides are incorporated from NLP course. The slides are compiled by Gen AI
TA Sai Yashwanth and Dr. Pooja Agarwal and few inputs by Dr. Arti Arya.
Generative AI and Its Applications
Course Content
GenerativeAI and Its Applications
Course Content
GenerativeAI and Its Applications
Evaluation Policy
GenerativeAI and Its Applications
Evaluation Policy
GenerativeAI and Its Applications
Introduction to GenAI

• Generative AI represents one of the most significant technological breakthroughs of our


time. Since the release of ChatGPT, we've witnessed an unprecedented transformation in
how we interact with technology.

• This revolutionary advancement builds upon decades of research, enabling machines to not
just analyze but create original content across multiple domains – from writing and coding to
creating art and music.

• At its core, Generative AI is about creation. Unlike traditional ML systems that excel at specific
tasks, generative models can produce entirely new content.

• These systems understand patterns in their training data and use this understanding to
generate original outputs. Examples: Stable Diffusions, GAN, Transformer models.

• This unit focuses on Transformer models.


GenerativeAI and Its Applications
Introduction to GenAI

• Generative AI models are trained on massive datasets using techniques like unsupervised
learning or self-supervised learning. They learn to predict the next element in a sequence
(e.g., the next word in a sentence or the next pixel in an image). Over time, they develop a
deep understanding of the structure and nuances of the data, enabling them to generate
outputs that are not mere reproductions but entirely new creations.

For example:

● GANs use a "generator" and a "discriminator" in a competitive setup to create realistic


outputs.
● Diffusion models iteratively refine noisy data to produce high-quality images.
● Transformers leverage attention mechanisms to focus on relevant parts of the input,
making them highly effective for tasks like text generation and translation.
GenerativeAI and Its Applications
Introduction to GenAI

• While Generative AI is a remarkable achievement, it is still a subset of Artificial Intelligence


and far from Artificial General Intelligence (AGI, “Super” AI which movies talk about will
destroy humanity)

Artificial General Intelligence (AGI)

● Human-Like Intelligence: AGI refers to machines that can perform


any intellectual task a human can, with the ability to reason,
learn, and adapt across diverse domains.
● Understanding and Reasoning: AGI would possess a
deeper understanding of the world, enabling it to make
decisions based on logic, ethics, and context.
● Autonomy: Unlike GenAI, AGI would be capable of
setting its own goals and pursuing them independently.
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Introduction to GenAI : Basic Model
• Generative Adversarial Networks (GANs):
Introduced by Ian Goodfellow in 2014,
GANs consist of two neural networks – a
generator and a discriminator – that work
in tandem. Creates new data that
resembles a training dataset.

• GANs are made up of two neural networks


that compete against each other to
generate new data.
• One network generates new data, while the
other network tries to determine if the
generated data is part of the original
dataset. GANs are popular because they
can create realistic fake data, such as
images, videos, and audio.
References:
[Link]
learning/gan/gan_structure
Generative AI and Its Applications
Introduction to GenAI : Basic Model
• Transformer Models: Transformers, such as OpenAI's GPT
(Generative Pre-trained Transformer) series, have
revolutionized text generation.

• These models use self-attention mechanisms to process and


generate sequences of text, enabling them to produce
coherent and contextually relevant outputs.

• The original GPT architecture was outlined in the research


paper titled "Improving Language Understanding by
Generative Pre-training," published in 2018 by Alec Radford
and his colleagues at OpenAI.
[Link]

References:
[Link]
GenerativeAI and Its Applications
Introduction to GenAI : Basic Model

• Diffusion models are a class of generative models that learn to


create/generate data by iteratively refining/denoising the random noise
into meaningful outputs, such as images or audio. They work by
reversing a diffusion process, where data is gradually corrupted into
noise, and then learning to reconstruct the original data step by step.

Notable Examples:
•Stable Diffusion: A
widely-used diffusion
model for text-to-image
generation.

•DALL·E 2: Combines
diffusion models with
transformers for text-to-
image tasks.
Generative AI and its Applications

LLM basics and Evolution


GenerativeAI and Its Applications
LLM Basic and Evolution
• Large Language Models (LLMs) stand as the cornerstone of modern
Generative AI.

• A large language model is a deep neural network designed to understand,


generate, and respond to human-like text. They are trained on massive
amounts of text data, sometimes encompassing large portions of the entire
publicly available text on the internet.

• Large refers to both the


model's size in terms
of Parameters and
the immense dataset
on which it's trained.
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
• Chatgpt is a wrapper around their llms - gpt4o, o1, o3

• LLM interfaces enable natural language


communication between users and AI
systems.

• This screenshot shows ChatGPT


writing a poem according to a
user's specifications.

References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
Simple timeline which led to LLMs as we see them today

1997: Introduction of Long Short-Term Memory (LSTM) networks (Learn More)


2010: Stanford's CoreNLP suite for sentiment analysis
2017: Transformer architecture introduction (Popular paper - Attention Is All You Need)
2018: GPT-1 (117M parameters) (Learn More)
2019: GPT-2 (1.5B parameters) (Learn More)
2020: GPT-3 (175B parameters) (Learn More)
2022: ChatGPT released to general public
2023 - Early 2024: GPT-4o introduction (Multimodal LLMs era started)
Late 2024: GPT o1 and o3 announced and open source LLMs competing with SOTA models
GenerativeAI and Its Applications
LLM Basic and Evolution
Generative AI and Its Applications
LLM Basic and Evolution
• Multimodal LLMs
Multimodal Large Language Models represent a significant evolution in AI, extending beyond text to
understand and process multiple forms of input - images, audio, video, and text simultaneously. These
models bring us closer to human-like perception and understanding of the world.
GPT-4 (Multimodal Version) – OpenAI
Flamingo – DeepMind
ImageBind – Meta
PaLM-E – Google
CLIP – OpenAI
Stable Diffusion XL (SDXL) - Stability AI
Picasso - NVIDIA
• Small LLMs
The landscape of AI is witnessing a significant shift towards smaller, more efficient Large Language Models
(LLMs) designed specifically for edge devices.
Generative AI and Its Applications
LLM Basic and Evolution
• Small LLMs
The landscape of AI is witnessing a significant shift towards smaller, more efficient Large
Language Models (LLMs) designed specifically for edge devices.
References:
Alpaca - Stanford University Sebastian Raschka
blog on multimodal llms

DistilGPT-2 - Hugging Face


OPT (Smaller Models) – Meta
Mistral 7B - Mistral AI
LoRA (Low-Rank Adaptation Models) - Various Implementations. GPT-Neo – EleutherAI
GPT-J – EleutherAI
GPT-2 (Smaller Variants) – OpenAI
LLaMA (Smaller Versions) – Meta
Flan-T5 – Google

These models are designed to run efficiently on smaller hardware or fine-tune specific tasks
while maintaining good performance.
GenerativeAI and Its Applications
LLM Basic and Evolution
Local LLMs and Open Source LLMs:
● Local LLMs run on personal hardware without cloud dependency
● Open-source models allow customization and transparency
● Growing ecosystem of community-driven development
Ollama is a popular open-source framework that simplifies running and managing
various LLMs locally on your machine
Some basic examples:
LLMs: gpt family, gemini family, claude family, and many more
Multimodal LLMs: llava, llama3.2-vision
Small LLMs: llama 3.2 1b and 3b, many more

References:
Sebastian Raschka blog on multimodal llms Check out Ollama
GenerativeAI and Its Applications
LLM Basic and Evolution
How to build an LLM?

The Two-Stage Process of Building an LLM

The process of creating an LLM involves two main stages: pretraining and finetuning.

1. Pretraining

● Definition:
○ Pretraining is the initial phase where the model is trained on a large, diverse corpus of
text data (referred to as "raw text").
○ The goal is to develop a broad understanding of language by predicting the next word in
a sequence (next-word prediction).

References:
Book: Build a Large Language Model (From Scratch)
Generative AI and Its Applications
LLM Basic and Evolution
● Dataset:
○ The dataset used for pretraining is typically massive and diverse, containing billions
of tokens from books, articles, websites, and other text sources.
○ Filtering may be applied to remove irrelevant or low-quality data (e.g., formatting
characters, unknown languages).
● Output:
○ The result of pretraining is a base model or foundation model that has a general
understanding of language.
○ Example: GPT-3, which can perform text completion and has limited few-shot
learning capabilities.

References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
2. Finetuning

● Definition:
○ Finetuning is the process of refining the pretrained model on a smaller, labeled
dataset that is specific to a particular task or domain.
○ This step adapts the general-purpose model to perform well on specific tasks.
● Types of Finetuning:
○ Instruction-Finetuning:
■ The labeled dataset consists of instruction-answer pairs.
■ Example: Training the model to translate text by providing queries and
their corresponding translations.
○ Classification Finetuning:
■ The labeled dataset consists of text and associated class labels.
■ Example: Training the model to classify emails as spam or non-spam.
● Output: References:
Book: Build a Large Language Model (From Scratch)
○ A specialized LLM that is optimized for a specific task or domain.
GenerativeAI and Its Applications
LLM Basic and Evolution

Why Pretraining and Finetuning?

● Pretraining:
○ Provides the model with a
broad understanding of
language, enabling it to
generalize across tasks.
● Finetuning:
○ Adapts the model to specific
tasks or domains, improving
its performance in those
areas.

References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution

LLMs can also be used for advance


searching - Perplexity

References:
Check out this video
Generative AI and its Applications

NLP Basics and Word Embeddings


GenerativeAI and Its Applications
Content and Function Words

We may divide parts of speech into two major groups:

Content Words Function Words

•Content words are words that have meaning.


•They are words we would look up in a dictionary,
such as "lamp," "computer," "drove.“
•New content words are constantly added to the English
language; old content words constantly leave the language as
they become obsolete.
•Therefore, we refer to content words as an "open" class.

Nouns, verbs, adjectives, and adverbs are content parts of


speech.
GenerativeAI and Its Applications
Content and Function Words

•Function words are words that exist to explain or create


grammatical or structural relationships into which the content
words may fit.

•Words like "of," "the," "to," they have little meaning on


their own.
•They are much fewer in number and generally do not change as
English adds and omits content words.
•Therefore, we refer to function words as a "closed" class.

Pronouns, prepositions, conjunctions, determiners qualifiers/int


ensifiers and interrogatives are some function parts of speech.
GenerativeAI and Its Applications
Types vs Token

▪ How about
▪ They picnicked by the pool, then lay back on the grass
and looked at the stars.
▪ 18 tokens (again counting punctuation)
▪ But we might also note that “the” is used 3 times, so there
are only 16 unique types (as opposed to tokens).
▪ In going forward, we’ll have occasion to focus on
counting both types and tokens of both words and N-
grams.
GenerativeAI and Its Applications
Tokenization
• Tokenization is the process of
• breaking up the sequence of characters in a text by locating the
word boundaries, the points where one word ends and another
begins.

• The words thus identified are frequently referred to as tokens.

• In written languages ( like Chinese, Japanese, Turkish) where no word


boundaries are explicitly marked in the writing system, tokenization is
also known as word segmentation, and this term is frequently used
synonymously with tokenization.

• In addition to word segmentation, sentence segmentation is a crucial


first step in text processing.
30
GenerativeAI and Its Applications
What a great day
Text Normalization Great work

• Once tokenization of the text is done,

Is it necessary to distinguish great, Great, and GREAT?

• Sentence-initial capitalization may be irrelevant to the classification


task.

• Also the complete elimination of case distinctions will result in a


smaller vocabulary, and thus smaller feature vectors.
GenerativeAI and Its Applications
Text Normalization

• Case conversion is a type of text normalization, which refers to


string transformations.

• Other forms of normalization include the standardization of


numbers (e.g., 1,000 to 1000) and dates (e.g., August 11, 2015 to
2015/11/23).
GENERATIVE AI AND ITS APPLICATIONS

Types of Ambiguities in Words


GenerativeAI and Its Applications
Ambiguity Sarah is wearing
a pink t-shirt or
her dog????

• Sarah gave a bath to her dog


wearing a pink t-shirt. Is the cake good
or bad??

• I have never tasted a cake quite


like that one before!
GenerativeAI and Its Applications
Types of Ambiguities

Ambiguities

Anaphoric
Lexical
Syntactic Semantic
GenerativeAI and Its Applications
Lexical Ambiguity

Lexical Ambiguity:

• Related to words.
• This type of ambiguity represents words that can have multiple
assertions.
• For instance, in English, the word “back” can be a noun ( backstage),
an adjective (back door), or an adverb (back away).

• Words have multiple meanings.

"I saw a bat."


bat = flying mammal / wooden club/ Sports equipment?
saw = past tense of "see" / present tense of "saw" (to cut with a saw.)
GenerativeAI and Its Applications
Syntactic Ambiguities
Syntactic Ambiguity:

This type of ambiguity represents sentences that how a particular sentence


can be parsed in multiple syntactical forms.
In multiple Structural representation which one needs to be selected.

verb noun
Eg.
“ I heard his cell phone ring in my office”.

The propositional phrase “in my office” can be parsed in a way


• that modifies the noun or (In this interpretation, "in my office" describes the
location of the cell phone.)

• another way that modifies the verb.(Here, "in my office" describes where
you were when you heard the cell phone ring.)
GenerativeAI and Its Applications
Syntactic Ambiguity
• Another Example:
• Mary ate a salad with spinach from California for lunch
on Tuesday.“

• Different meanings
• Mary ate a salad that contained spinach, and the spinach was
sourced from California. She had this salad for lunch on Tuesday.
• Mary ate a salad, and she also had spinach from California as a
side dish. She had this meal for lunch on Tuesday.
GenerativeAI and Its Applications
Syntactic Ambiguity
• Another Example:
• Mary ate a salad with spinach from California for lunch
on Tuesday.“
• "with spinach" can attach to "salad" or "ate“,
• "from California" can attach to "spinach", "salad", or "ate".
• "for lunch" can attach to "California", "spinach", "salad", or
"ate"
• and "on Tuesday" can attach to "lunch", "California", "spinach",
"salad" or "ate".
• Nonetheless there are 42 possible different parse trees
for this sentence.
GenerativeAI and Its Applications
Semantic Ambiguity
Semantic Ambiguity:
Related to the interpretation of sentence. OR
How you interpret the meaning of entire sentence.
• Eg.,
• I heard his cell phone ring in my office can be interpreted as if “I was
physically present in the office” or as if “the cell phone was in the
office”.
• Lucy owns a parrot (existentially quantified) that is larger than a
cat (either universally quantified or means "typical cats")

Another Example
• "The dog is chasing the cat." vs. "The dog has been domesticated for
10,000 years."
• In the first sentence, "The dog" means to a particular dog;
• In the second, it means the species "dog".
GenerativeAI and Its Applications
Anaphoric Ambiguity

• Anaphoric ambiguity
• A phrase or word refers to something previously mentioned, but
there is more than one possibility for machine to understand that
word.
• Eg.
• "Margaret invited Susan for a visit, and she gave her a good lunch."
(she = Margaret; her = Susan)
• "Margaret invited Susan for a visit, but she told her she had to go to
work" (she = Susan; her = Margaret.)
• "On the train to Boston, George chatted with another passenger. The
man turned out to be a professional hockey player."
• (The man = another passenger).
GenerativeAI and Its Applications
Metonymy Ambiguity

Metonymy:

The most difficult type of ambiguity, metonymy deals with


phrases in which the literal meaning is different from the
figurative assertion.

Eg.,
Samsung is screaming for new management.

Here screaming doesn’t literally mean yelling.

Looking for a better management


GenerativeAI and Its Applications
Examples

• Find the different types of ambiguities in the following


sentence.
“Elsa tried to reach her aunt on the phone, but she didn't
answer”.
1. Lexical ambiguity:
➢ The word "tried" means "attempted" not "held a court proceedings", or "test" (as
in "Elsa tried the lemonade").
➢ The word "reach" means "establish communication" not "physically arrive at" (as
in "the boat reached the shore").
2. Syntactic ambiguity:
➢ The phrase "on the phone" attaches to "reach" and thus means "using the phone"
not to "aunt" which would mean "her aunt who was physically on top of the
phone" (compare "her aunt in Seattle").
3. Anaphoric ambiguity:
"she" means the aunt, not Elsa.
GenerativeAI and Its Applications
Examples

• Find the different types of ambiguities in the following


sentence.
“John saw the man with the telescope”.
1. Syntactic ambiguity:
• In this case, "with the telescope" can modify either "John" or "the man," leading
to different interpretations of who has the telescope.
2. Semantic ambiguity:
• John used a telescope to see the man
• Or The man had a telescope:
These different interpretations arise from the relationship between "John," "the
man," and "the telescope," leading to multiple possible meanings of the sentence.
GENERATIVE AI AND ITS APPLICATIONS

Word Embeddings
Generative AI and Its Applications
One-Hot Encodings

• Word vectors are vectors of weights. Say there are some dimensions
and these vectors are defining those words in these n dimensions.
• In a simple 1-of-N (or one-hot) encoding every element in the vector
is associated with a word in the vocabulary
To train embeddings, words are often converted to one-hot encodings, which
are then passed through an embedding layer in the neural network to learn
dense vector representations.
• With this encoding,
Amongst all the computing semantic
Hotel [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ]
dimensions, only one Motel[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ] similarity is difficult.
dimension is 1 and all
other dimensions are • Sim(hotel, motel)=0
0 and this dimension
may correspond to Each word in the vocabulary is represented as a sparse vector
the index of the word with a single "1" corresponding to its index in the vocabulary
and "0s" elsewhere.
46
GenerativeAI and Its Applications
Limitations of One-Hot Encodings

• Using this type of encoding, there is no useful comparison can be made


between vectors other than equality testing. This kind of embedding is not
able to provide the semantic similarity or any other appropriate similarity.
• So, we have seen in distributional representation :

A word 𝑤𝑖 in the corpus is given a distributional representation by an


embedding 𝑤𝑖 ∈ 𝑅𝑑 ie a d-dimension of the vector, which is mostly learnt.

Motivation=

47
GenerativeAI and Its Applications
Distributional Representation of words
• Dense Vectors are good at capturing synonyms and semantic similarity.

• When we talk about word embeddings( word vector) being d-dimensional, which may vary
from 50-1000. These d-dimensions don’t have very clear interpretation.

• Dense vectors work better for almost all NLP tasks.


• Consider King, Queen, Man, Woman, Child as 5 words in a vocabulary and now say Queen
can represented as a1-hot encoding 0 1 0 0 0 and king as 1 0 0 0 0

• Let’s consider the 5-dimensions as Royalty, Masculine, Feminine, Age, Height etc and define
some words based on it:
• King. Queen. Man

0.92 0.95 0.01


0.98 0.04 0.98
The dense vectors are also capable of capturing the synonymy.
0.02 0.98 0.03
0.5 0.4 0.5
48
0.6 0.4 0.4
Generative AI and Its Applications
Input Embedding

• We can think of a word embedding layer as a type of


lookup table that grabs a learned vector representation
of every word.
• Neural n/ws tend to learn thru numbers.
• Therefore, every word is mapped to a continuous valued
vector in order.

• What should be the characteristics of a numerical


representation of textual data?
It should have semantic meaning.
It should provide input representation that is
informative
It should have considerable impact on overall model
performance.
Generative AI and Its Applications
Word Embeddings

• What are Word Embeddings?


Word embeddings are numerical representations of text that convert words into vectors
that machines can understand and process. Think of them as translating human language
into a mathematical language that AI models can work with.

References:
Book: Build a Large Language Model (From Scratch)
Generative AI and Its Applications
Word Embeddings

• Why Do We Need Them?


➢ Neural networks can't process raw text directly (Input is generally
vectors)
➢ Text needs to be converted to continuous-valued vectors (Array
of Numbers)
➢ Enables mathematical operations required for neural networks
➢ Captures semantic relationships between words
➢ A good embedding should preserve the meaning of the token
that should adapt well to the context in which it is used

References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings

Why do we need Embeddings?

Vector Representation
● Words are represented as dense vectors in a continuous space
(Basically array of numbers)
● Similar words cluster together in this space.
● Easy for neural networks to perform operations.
● Example: "king" - "man" + "woman" ≈ "queen"
● Dimensions (features) can range from 2 to thousands
○ GPT-2 (small): 768 dimensions (In GPT-2, each token (word
or subword) in the input text is represented as a vector of
size 768 in the model's internal representation. Refers to the
size of embedding space in transformer’s model.)
References:
○ GPT-3 (large): 12,288 dimensions
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings

Why do we need Embeddings?

Contextual Understanding

● Words appearing in similar contexts have similar


meanings
● Embeddings capture semantic relationships
● Example: "bird", "wing", "fly" cluster together
● Different from simple one-hot encoding

References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings

• Definition
• Real-valued and sub-symbolic representations of words as dense numeric
vectors. So, word embeddings are dense numeric vectors.
• Distributed representation of word meanings (not count-based on
frequency of word)
• Usually learned with neural networks.
• Specific dimensions of the resulting vectors cannot be directly mapped to
symbolic representation.
• Models that seek to predict between a center word and context words
(predict models)
• Key elements of deep learning models.
55
56
57
58
59
60
61
62
GenerativeAI and Its Applications
Neural Word embeddings

Framework for learning word embeddings;


✓Takes words from a very large corpus of text as input (unsupervised)
✓Learn a vector representation for each word to predict relation between
every word and its context
✓Fully connected feed-forward neural network with one hidden layer
Two main algorithms:
✓Skip-gram: predicts context taking the center/target word as input.
Example: For the sentence "The dog barked loudly," given the word "dog," it
tries to predict "The" and "barked."
✓Continuous Bag of Words (CBOW): predicts center/target/focus word from the
given context (sum of surrounding words vectors).
Example: For the same sentence, given the context words "The" and
"barked," it tries to predict the word "dog."

63
GenerativeAI and Its Applications
Word2vec
• is a machine learning algorithm for generating dense, distributed vector
representations (embeddings) of words, such that words with similar meanings
are positioned close to one another in the vector space. Given by Google in 2013.

• Word2vec is not a single algorithm but a combination of two techniques –

• CBOW(Continuous bag of words) and Skip-gram model.

• Both of these are not Deep neural networks, they are shallow neural networks
which map word(s) to the target variable which is also a word(s).

• Both of these techniques learn weights which act as word vector representations.
64
GenerativeAI and Its Applications
Word2vec
▪ In contrast to language models( which predicts the next word),
embedding models consider the history (previous words) and the
future (following words) of a center word.

▪ The number of words considered is called the window size(standard


size = 5).

▪ Importance of Window Size


“Australian scientists discover stars with telescopes."
context window size = 2, center word= “ star”

Note: different meaning of “stars" with and without telescope


Example from: Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd
65
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 302-308).
GenerativeAI and Its Applications
Word2vec

• Instead of counting how often each word w occurs near "apricot"


• Train a classifier on a binary prediction task:
• Is w likely to show up near "apricot"?
• We don’t actually care about this task
• But we'll take the learned classifier weights as the word
embeddings
• Big idea: self-supervision:
• A word c (focus word) that occurs near apricot in the corpus
acts as "correct answer” to the question “ how likely the
probability of word ‘c’ to show up near ‘apricot’?”
• This self-supervision avoids any need of any hand labeled
supervision.
• No need for human labels
❖ Self-supervision is a type of machine learning paradigm where a model learns from data
that does not require manual labeling. 66
GenerativeAI and Its Applications
Word2vec vs advanced Neural embeddings

• Despite capturing context during training, the resulting embeddings are


static in Word2vec.
• Each word has a single embedding, regardless of different contexts it might
appear in.
• Advanced Neural Embeddings:

• Models like BERT and GPT generate different embeddings for the same word
depending on its context within a sentence.
• This allows them to handle polysemy (words with multiple meanings) more
effectively.

67
Generative AI and Its Applications
Parts of Speech Tagging (PoS)
GenerativeAI and Its Applications
POS (parts of speech)

▪ Parts of speech (POS) are useful because they reveal a lot about a
word and its neighbors.
▪ Knowing a word POS(noun or verb, …) tells us about
▪ likely neighboring words (nouns are preceded by determiners and adjectives,
verbs by nouns) and
▪ syntactic structure (nouns are generally part of noun phrases), making POS
tagging a key aspect of parsing.
▪ Parts of speech are useful features
▪ for labeling named entities(NER) like people or organizations in information
extraction, or
▪ for coreference resolution(the task of finding all expressions that refer to the
same entity in a text)
▪ for sentiment analysis, question answering, and word sense disambiguation.
69
GenerativeAI and Its Applications
POS (parts of speech) examples

1) N noun chair, bandwidth, pacing


2) V verb study, debate, munch
3) ADJ adjective purple, tall, ridiculous
4) ADV adverb unfortunately, slowly,
5) P preposition of, by, to
6) PRO pronoun I, me, mine
7) DET determiner the, a, that, those
8) Conj conjunctions and, or

70
GenerativeAI and Its Applications
POS (parts of speech)

▪ Parts of speech can be divided into two broad categories:


▪ Closed class type
▪ Open class type (verbs, adverbs, nouns, and adjectives)
▪ Closed classes No inclusion of new words very often. Closed classes are
those with relatively fixed membership, such as prepositions—new
prepositions are rarely coined. Like preposition, aux verbs etc
▪ Closed class words are generally function words like of, it, and, or you,
occur frequently.
▪ Nouns and verbs are open classes—new nouns and verbs like ‘iPhone’ or
to ‘fax’ are continually being created or borrowed.
▪ Eg, doomscroll, ChatGPT, zooming, meme, selfie and the list endless.

71
GenerativeAI and Its Applications
POS (parts of speech)

▪ The important closed classes in English include:


▪ Prepositions: on, under, over, near, by, at, from, to, with
▪ Particles: up, down, on, off, in, out, at, by
▪ Determiners: a, an, the, this/that, these/those, its, our, their
▪ Conjunctions: and, but, or, as, if, when
▪ Pronouns: she, who, I, others
▪ Auxiliary verbs: can, may, should, is, are, do, have, ...
▪ Numerals: one, two, three, first, second, third, …

72
GenerativeAI and Its Applications
POS (parts of speech)

Open class (lexical) words


Nouns Verbs Adjectives old older oldest

Proper Common Main Adverbs slowly


IBM cat / cats see
Italy snow registered Numbers … more
122,312
one
Closed class (functional)
Modals
Determiners the some can Prepositions to with
had
Conjunctions and or Particles off up … more

Pronouns he its Interjections Ow Eh


73
GenerativeAI and Its Applications
Different Corpora and Tree Banks

• A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.
• Its plural is corpora.
• They can be derived in different ways like text that was originally electronic,
transcripts of spoken language and optical character recognition, etc.
• Corpora are generally solely used for statistical linguistic analysis and hypothesis
testing.
• Eg Brown Corpus, The Brown Corpus, also known as the Brown University Standard
Corpus of Present-Day American English, is a collection of text samples from a wide
range of sources, compiled in the 1960s. It was one of the first major text corpora
created for linguistic research and has been widely used in the field of computational
linguistics.
• British National Corpus( 100 million words representing British English) etc 74
GenerativeAI and Its Applications
Different Corpora and Tree Banks

Brown Corpus is a million word collection of samples from 500


written texts from different genres(like newspapers, novels,
non-fiction, academics etc.

75
GenerativeAI and Its Applications
Different Corpora and Tree Banks
TreeBanks
• is a linguistic resource that consists of a large collection of sentences annotated
with syntactic or semantic structure.

• These annotations typically represent the grammatical structure of sentences in the


form of parse trees, which show the hierarchical relationships between words and
phrases.

• Linguistically parsed text corpus that annotates syntactic or semantic sentence structure.

• Geoffrey Leech coined the term ‘treebank’, which represents that the most common way
of representing the grammatical analysis is by means of a tree structure.

• Generally, Treebanks are created on the top of a corpus, which has already been
annotated with part-of-speech tags.
76
GenerativeAI and Its Applications
POS (parts of speech)
▪ What are POS tags?
POS tags are also known as word classes, morphological classes, or lexical tags
▪ Number of tags used by different systems/corpora/languages are different
▪ Penn Treebank (Wall Street Journal Newswire): 45 tags
▪ Brown corpus (Mixed genres like fiction, biographies, etc): 87 tags
▪ Lancaster UCREL C5: 61 tags
▪ Lancaster C7: 145 tags

▪ POS tags can be of varying granularity.


▪ Morphologically complex languages have very different tag-sets from English or Dutch.
(Morphologically complex languages are those that have a rich system of word formation, often
involving extensive use of prefixes, suffixes, infixes, and other morphological processes to convey
grammatical relationships and meanings. Examples of such languages include Finnish, Turkish, and
Arabic.) 77
GenerativeAI and Its Applications
POS (parts of speech)

78
GenerativeAI and Its Applications
POS tagging

▪ Part-of-speech tagging is the process of assigning a part-of-speech


marker to each word in an input text.

▪ The input to a tagging algorithm is a sequence of (tokenized) words


and a tagset, and the output is a sequence of tags, one per token.

79
GenerativeAI and Its Applications
POS tagging

▪ Tagging is a disambiguation task; words are ambiguous - have more


than one possible POS - and the goal is to find the correct tag for the
situation.
▪ For example,
▪ book can be a verb (book that flight) or a noun (give me that book).
▪ That can be a determiner (Does that flight serve dinner) or a
complementizer (I thought that your flight was earlier).
▪ The goal of POS-tagging is to resolve these ambiguities, choosing the
proper tag for the context.

80
GenerativeAI and Its Applications
Why is POS tagging hard?

• Ambiguity
1. “Plants/N need light and water.”
“Each one plant/V one.”
2. “Flies like a flower”
▪ Flies: noun or verb?
▪ like: preposition, adverb, conjunction, noun, or verb?
▪ a: article, noun, or preposition?
▪ flower: noun or verb?

81
GenerativeAI and Its Applications
Why is POS tagging hard?

• Words often have more than one POS


• For example, the word back in following sentences:
1. The back door = JJ(Adjective)
2. On my back = NN(Noun singular)
3. Win the voters back = RB(Adverb)
4. Promised to back the bill = VB(Verb base form)
• The POS tagging problem is to determine the POS tag for a
particular instance of a word.

82
GenerativeAI and Its Applications
3 approaches for POS tagging

1. Rule-based tagging
• The ENGTWOL tagger (Voutilainen, 1995) is a rule- based tagger
based on two-stage architecture.

2. Stochastic (Probability/Frequency) tagging


• Most frequent tag algorithm
• HMM (Hidden Markov Model) tagging

3. Transformation-based tagging
• Brill tagger

83
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
▪ Stage 1:
▪ Start with a dictionary of Tagsets
▪ Assign all possible tags to words from the dictionary exploiting
morphological/orthographic rules.
▪ Stage 2:
▪ Write rules by hand to selectively remove tags.
▪ Disambiguation is done by analyzing the linguistic features of the word,
its preceding word, its following word and other aspects.
▪ For example, if the preceding word is an article then the word in
question must be noun. This information is coded in the form of rules.
▪ Leaving the correct tag for each word.
Example of rules: NP → Det (Adj*) N
For example: the clever student 84
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
Example 1:
Rule based : Start with a dictionary
Personal pronoun Past participle
• she: PRP Past tense verb

• promised: VBN,VBD
Adverb
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB

85
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
Example 1: Write rules to eliminate tags

R1: Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

86
GenerativeAI and Its Applications
2. Stochastic tagging
The tag encountered most frequently with the word in the training set is
the one assigned to an instance of that word.

• Word Frequency Approach


• Create dictionary with each possible tag for a word
• Take a tagged corpus
• Count the number of times each tag occurs for that word.
• Given a new sentence
• For each word, pick the most frequent tag for that word using
above.
• The main issue with this approach is that it may yield inadmissible
sequence of tags.
87
GenerativeAI and Its Applications
2. Stochastic tagging
• Tag Sequence Probabilities – N-gram approach
• Probabilities: Tagging with lexical frequencies
Sami is expected to race tomorrow
❖Sami/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
❖People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN
the/DT race/NN for/IN outer/JJ space/NN
• Problem: Assign a tag to race given its lexical frequency
• Solution: We choose the tag that has the greater likelihood
P(race|VB) = Count(race is verb)/Total Count(race)
P(race|NN)= Count(race is noun)/Total Count(race)
• In Brown corpus, P(race|VB) = 96/98 = .98 which is higher than
P(race|NN)=0.2

88
GenerativeAI and Its Applications
Stochastic tagging

• Properties of Stochastic POS Tagging


1. This POS tagging is based on the probability of tag occurring.
2. It requires training corpus.
3. There would be no probability for the words that do not exist
in the corpus.
4. It uses different testing corpus (other than training corpus).
5. It is the simplest POS tagging because it chooses most frequent
tags associated with a word in training corpus.

89
GenerativeAI and Its Applications
3. Transformation-based tagging

▪ It is also called Brill tagging.


▪ Brill tagging is an instance of the transformation-based learning
(TBL).
▪ It draws the inspiration from both the rule-based and stochastic
taggers. (( may call as hybrid tagging)

▪ TBL, allows us to have linguistic knowledge in a readable form,


transforms one state to another state by using transformation rules.

90
GenerativeAI and Its Applications
Transformation-based tagging

• Basic idea:
• First use frequency, then revise it using contextual rule.
• The Brill tagger was described and invented by Eric Brill in his 1993.
It can be summarized as an "error-driven transformation-based tagger".

• The Brill tagger is:


▪ a form of supervised learning, which aims to minimize error and
▪ a transformation-based process, in the sense that a tag is assigned
to each word and changed using a set of predefined rules.

91
GenerativeAI and Its Applications
Transformation-based tagging

• In the transformation process,


• if the word is known, it first assigns the most frequent tag
(may be noun or determiner) or
• if the word is unknown, it naively assigns the tag "noun" to it.
• Applying over and over the handwritten rules ( Rule based
tagging), changing the incorrect tags, a quite high accuracy is
achieved.

92
GenerativeAI and Its Applications
Transformation-based tagging
• Example:
• It is expected to race tomorrow. The race for outer space.
• Tagging algorithm:
1. Tag all uses of “race” as NN (most likely tag in the Brown
corpus having approx. one million words.)
• It is expected to race/NN tomorrow
• the race/NN for outer space
2. Use a transformation rule to replace the tag NN with VB for
all uses of “race” preceded by the tag TO:
• It is expected to race/VB tomorrow
• the race/NN for outer space

93
GenerativeAI and Its Applications

Identification of POS Tag using


HMM

94
GenerativeAI and Its Applications
POS Tag using HMM

95
GenerativeAI and Its Applications
POS Tag using HMM

96
GenerativeAI and Its Applications
POS Tag using HMM

97
GenerativeAI and Its Applications
POS Tag using HMM

98
GenerativeAI and Its Applications
POS Tag using HMM

99
GenerativeAI and Its Applications
POS Tag using HMM

100
GenerativeAI and Its Applications
POS Tag using HMM

101
GenerativeAI and Its Applications
POS Tag using HMM

102
GenerativeAI and Its Applications
POS Tag using HMM

103
GenerativeAI and Its Applications
POS Tag using HMM

104
GenerativeAI and Its Applications
POS Tag using HMM

105
GenerativeAI and Its Applications
POS Tag using HMM

106
GenerativeAI and Its Applications
POS Tag using HMM

107
GenerativeAI and Its Applications
POS Tag using HMM

108
GenerativeAI and Its Applications
POS Tag using HMM

109
GenerativeAI and Its Applications
Named Entity Recognition(NER)
GenerativeAI and Its Applications
Information Extraction (IE)

1. Information extraction is an area of NLP that deals with finding


factual information in free text.

2. In formal terms, facts are structured objects, such as database


records. Such a record may capture a real-world entity with its
attributes, with its arguments or actors: who did what to
whom, where and when.

3. Information is typically sought in a particular target setting,


e.g. corporate mergers and acquisitions etc.

111
GenerativeAI and Its Applications
An Example Information Extraction

• Three bombs have exploded in north-eastern Nigeria, killing 25


people and wounding 12 in an attack carried out by Terrorist group.
Authorities said the bombs exploded on Sunday afternoon in the city
of Maiduguri
:Information extracted :

• TYPE : Crisis ; SUBTYPE : BOMBING ; LOCATION : Maiduguri


• DEAD-COUNT : 25 ; INJURED-COUNT : 12 ;
• PERPETRATOR : Terrorist group; WEAPONS : Bomb
• TIME: Sunday afternoon

112
GenerativeAI and Its Applications
Applications of IE

1. Provides dramatic improvements in the conversion of the raw textual


information into structured data and are increasingly being deployed in
commercial applications.

2. Can constitute a core component in many other NLP applications i.e.


Machine Translation, Question Answering, Text Summarization, Opinion
Mining etc.

113
GenerativeAI and Its Applications
Classic Task of IE

Classic Tasks of IE

➢NER

➢Co-reference Resolution

➢Relation Extraction

➢Event Extraction

114
GenerativeAI and Its Applications
Classic IE Tasks : Named Entity Recognition

1. Addresses the problem of the identification (detection) and


classification of predefined types of named entities, such as
organizations (e.g., ‘World Health Organisation’), persons (e.g.,
‘Muammar Kaddafi’), place names (e.g., ‘the Baltic Sea’), temporal
expressions (e.g., ‘1 September 2011’), numerical and currency
expressions (e.g., ‘20 Million Euros’) etc.

2. NER task can additionally include extracting descriptive information


from the text about the detected entities through filling of a small-
scale template.
For example, in the case of persons, it may include extracting the
title, position, nationality, sex, and other attributes of the
person

115
GenerativeAI and Its Applications
Classic IE Tasks: Co-reference Resolution

1. It requires the identification of multiple (co-referring) mentions of the


same entity in the text.
2. Entity mention can be :
• Named, in case an entity is referred to by name; e.g., ‘General
Electric’ and ‘GE’.

• Pronominal, in case an entity is referred to with a pronoun; e.g., in


‘John bought food. But he forgot to buy drinks.’, the pronoun he
refers to John. Here he is the named entity and its pronominal.

• Nominal, in case an entity is referred to with a nominal phrase; e.g.,


in ‘Microsoft revealed its earnings. The company also unveiled future
plans.’ the definite noun phrase The company is a nominal phrase
that refers to Microsoft.

• Implicit, as in case of using zero-anaphora. Prime minister has


visited the place of disaster. [He] flew over with a helicopter 116
GenerativeAI and Its Applications
Classic IE Tasks: Relation Extraction

• Relation Extraction is the task of detecting and classifying


predefined relationships between entities identified in text.

Example: EmployeeOf (Steve Jobs, Apple): a relation


between a person and an organisation, extracted from
‘Steve Jobs works for Apple’

LocatedIn (Smith, New York): a relation between a person


and location, extracted from ‘Mr. Smith gave a talk at the
conference in New York’

117
GenerativeAI and Its Applications
Classic IE Tasks : Event Extraction

1. Event Extraction refers to the task of identifying events in free


text and deriving detailed and structured information about
them, ideally identifying who did what to whom, when,
where, through what methods (instruments) and why.
2. Usually, event extraction involves extraction of several
entities and relationships between them.
3. Example: The extraction of information on new joint
ventures, where the aim is to identify the partners, products,
profits and capitalization of the joint venture.

EE is considered to be the hardest of the four IE tasks

118
GenerativeAI and Its Applications
Applications?
1. An understanding of the Named Entities involved in a document provides
much richer analytical frameworks and cross-referencing.
2. NER is extensively used in QnA systems, document clustering and text
analytics applications.
3. In Sentiment analysis/ opinion mining, one might want to know a
consumer’s sentiment/opinion toward a particular entity.
4. Named entity tagging is also central to Natural Language Understanding
tasks of building semantic representations, like extracting events and the
relationship between participants.
5. Automation of customer support : Automatically tagged locations and
product names can help smoothly route customer queries to right location
and people in a company with multiple branches and many employee

119
GenerativeAI and Its Applications
What is Named Entity and Named Entity Recognition
• A named entity is anything that can be referred to with a
proper name: a person, a location, an organization.

• The task of named entity recognition (NER) is to find spans


of text that constitute proper names and tag the type of
named entity recognition NER the entity.

• Four entity tags are most common:


• PER (person),
• LOC (location),
• ORG (organization), or
• GPE (geo-political entity).
120
GenerativeAI and Its Applications
NER Example

Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American
Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim
Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME
Thursday] and applies to most routes where it competes against discount carriers, such as
[LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].

A list of generic named entity types with the kinds of entities


they refer to.

121
GenerativeAI and Its Applications
NER

• In POS tagging, there is no segmentation problem since each


word gets one tag.

• But for NER one has to find and label spans of text, and its difficult
because of the ambiguity of segmentation; we need to decide
what’s an entity and what isn’t, and where the boundaries are.

123
GenerativeAI and Its Applications
Named Entity and not a Named entity

• Hotel & Taj Hotel

• Flower & Rose Flower

• Beach & Kovalam Beach

• Airport & Indira Gandhi International airport

• The School & Good Shepherd School

• Prime Minister & Mr. Narendra Modi

124
GenerativeAI and Its Applications
Some problems in identifying NE
• Variation of NE (same entity in different form).
• Mahatama Gandhi, Gandhi , Bapu
• Ambiguity of NE types:
• 1945 (date vs. time)
• Washington (location vs. person)
• May (person vs. month)
• Tata (person vs. organization)
• Person vs Location
• Sir C. P Ramaswamy was the Divan of Travancore (Per)
• Sir C.P Ramaswamy Road is in Chennai (Loc)
• Person vs Organization
• Anil Ambani opened Reliance Fresh (Per)
• Reliance Fresh is under Anil Amabani Group Ltd (Org) 125
GenerativeAI and Its Applications
Tagset for Named Entity

• ACE tagset is Hierarchical


• ACE-Automatic Content Extraction
• More well known and most commonly used

• CLIA tagset
• CLIA-is Hierarchical -Similar to ACE
• Developed for two domains i.e. Tourism and Health

126
GenerativeAI and Its Applications
Named entity types

The Named entity hierarchy is divided into three major


Entity classes
Name, Time and Numerical expressions. Person, location , organization
ENAMEX

NE TYPES NUMEX Number type

TIMEX

127
GenerativeAI and Its Applications
How to Annotate

• [Link]
• 1.1 Person
• 1.1.1 Individual
• These refer to names of each individual person,
• Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul
Kalam</ENAMEX>

128
GenerativeAI and Its Applications
Annotation continued

Family Name
• In general we find that a person name consists of a family name.
Whenever an instance of individual name occurs with family name, then
that part of the name, which refers to family name, must be tagged
specifically with subtag “FAMILYNAME” as shown below.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” >abc
<ENAMEX TYPE= “PERSON” SUBTYPE_2= “FAMILYNAME”> abc
</ENAMEX>
Examples:
English:
<ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu
Prasad<ENAMEX TYPE= “PERSON”
SUBTYPE_2=“FAMILYNAME”>Yadav</ENAMEX>

129
GenerativeAI and Its Applications

ACE Tagset Continued


ACE Tagset Continued
• NUMEX Tagset Counts
• Distance First Level Tags -3
• Money
Second Level -43
• Quantity
• Count Third Level – 40
• TIMEX Total - 86
• Time
• Date
• Day
• Period

130
GenerativeAI and Its Applications • Manmade
• Religious Places
TAGSET • Roads/Highways
• Museum
• ENAMEX
• Theme parks/Parks/Gardens
• Person
• Monuments
• Individual
• Facilities
• Family name • Hospitals
• Title • Institutes
• Group • Library
• Organization • Hotel/Restaurants/Lodges
• Government • Plant/Factories
• Public/private company • Police Station/Fire Services
• Religious
• Public Comfort Stations
• Non-government
• Airports
• Political Party • Ports
• Para military • Bus-Stations
• Charitable • Locomotives
• Association • Artifacts
• GPE (Geo-political Social Entity) • Implements
• Media • Ammunition
• Location • Paintings
• Place • Sculptures
• District • Cloths
• City • Gems & Stones
• State • Entertainment
• Dance
• Nation • Music
• Continent • Drama/Cinema
• Address • Sports
• Water-bodies • Events/Exhibitions/Conferences
• Landscapes • Cuisine’s
• Celestial Bodies
• Animals
• Plants
131
GenerativeAI and Its Applications
Enamex types

132
GenerativeAI and Its Applications
Entity name types (ENAMEX Subtypes)
1. Persons are entities limited to humans. Individual refer to names of
each individual person. Group refers to set of individual
2. Location entities are limited to geographical entities such as
geographical areas like names of countries, cities, continents and
landmasses, bodies of water, and geological formations.
3. Organization entities are limited to corporations, agencies, and other
groups of people defined by an established organizational structure
4. Entertainment entities denote activities, which divert and hold
human attention or interest, giving pleasure, happiness, amusement
especially performance of some kind such as dance, music, sports,
events.
En: [Robin]PERSON is working at [HCL]ORGANIZATION , which is in
[Chennai] LOCATION
En: [Flower Exhibition] ENTERTAINMENT is held at [Hyderabad]LOCATION
133
GenerativeAI and Its Applications
Entity name types (ENAMEX Subtypes)

5. Facilities are limited to buildings and other permanent man-


made structures and real estate improvements like hospitals,
airport, colleges, libraries etc.
6. A locomotive entity is a physical device primarily designed to
move an object from one location to another, by carrying,
pulling, or pushing the transported object.
7. Artifact entities are objects or things, produced or shaped by
human craft, such as tools, weapons/ammunition, art
paintings, clothes, ornaments, medicines

En: [Apolo Hospital] FACILITY is in [Chennai] LOCATION


En: [Bangalore Express]LOCOMOTIVE departs from [Chennai] LOCATION at
[7.30pm] Time.
En: [Vinayaga Statue] ARTIFACT is looking beautiful.

134
GenerativeAI and Its Applications
Entity name types

8. Materials refer to the names of food items, cuisines,


chemicals and cosmetics
9. ORGANISMS are the names of different animal
species including birds, reptiles, viruses, bacteria and
names of herbs, medicinal plants, shrubs, trees, fruits,
flowers etc.
10. Disease are names of disease, symptoms,
diagnosis and treatments.

135
GenerativeAI and Its Applications
Numerical expression

136
GenerativeAI and Its Applications
Numerical expressions

➢ Distance refers to the distance measures such as kilometers,


Centimeters, meters, acres, feet etc.
Example: 10 cm., twenty feet, 15 hectares
➢ Money specifies the different currency value such as rupee, euro,
Dinar, dollar etc.
Example: Rs. 1000, 250 Euro, $160
➢ Count denotes the number (or counts) of Items/ articles/things etc.
Example: 5 subjects, 12 students, 20 books
➢ Quantity measurements like liters, tons, grams, volts etc. are comes
under this category.
Example: 20 litres, 22 kg, 50g, 100 volts
137
GenerativeAI and Its Applications
Time Expressions

138
GenerativeAI and Its Applications
Temporal Expressions

➢ Temporal expressions are the entities refers to time, date, year, month and day
➢ Time: These refer to expressions of time, includes different forms
➢ of expressing time. This also includes Hours, minutes and seconds.
➢ 5’o clock in the morning
➢ 9.30 a.m.
➢ Evening 6.30 p.m.
➢ Date: This refers to expressions of Date such as 13/12/2001 etc in
➢ different forms. This also includes month, date and year
➢ August 15 1947
➢ 1956
➢ September 11
139
GenerativeAI and Its Applications
Temporal Expressions

Period: refers to expressions, which express duration of time or


time periods or time intervals.
Example
− 17 th century
− 10 minutes
− 10 a.m. to 12 p.m.
− One year

140
GenerativeAI and Its Applications
Nested or embedded entity

Nested Entities: Refers to the named entities which occurs within


another named entities. Also called as embedded entities.

En: Madurai Meenakshi Temple


En: Lalu Prasad Yadav
En : Nitoor Srinivasa rao
En : Rajeev Gandhi Salai

141
GenerativeAI and Its Applications
Approaches for NER

1) Dictionary and Rule Based


2) Machine Learning
Hidden Markov Model (HMM)
Naïve Bayes Classifier
Maximum Entropy Markov Model (MEMM)
Conditional random Fields (CRF)
3) Hybrid Approach (combines both rule and ML approaches)

142
GenerativeAI and Its Applications
Dictionary (Gazetteers) Look-up Approach

• Uses Dictionaries for identifying NERs ( Gazetteers)


• Gazetteer contains all named entities (Nes) from all domains.
• Advantage
• Very simple approach
• Gives very high precision
• Disadvantage
• Preparation of exhaustive dictionary is a tedious and
expensive process.
• The dictionary should cover the different spellings of the
same place.

143
GenerativeAI and Its Applications
Rule Based NER
• A rule-based system consists of :
• Collection of rules
• A set of policies to control firing of multiple rules (The rule
getting triggered at one particular time)
• Create regular expressions to extract:
• Telephone number
• E-mail
• Capitalized names etc.
• Blocks of digits separated by hyphens
RegEx = (\d+\-)+\d+

• matches valid phone numbers like 900-865-1125 and 725-1234


• fails to identify numbers like 800.865.1125 and (800)865-CARE
144
GenerativeAI and Its Applications
Rule Based NER
Rule Based NER
• Rules to extract locations
• Capitalized word + {city, center, river} indicates location
Ex. New York city
Hudson river

• Capitalized word + {street, boulevard, avenue} indicates location


Ex. Fifth avenue

145
GenerativeAI and Its Applications
Rule Based NER

• Use context patterns


• [PERSON] earned [MONEY]
Ex. Frank earned $20
• [PERSON] joined [ORGANIZATION]
Ex. Sam joined IBM
• [PERSON],[JOBTITLE]
Ex. Mary, the teacher
• [PERSON|ORGANIZATION] fly to [LOCATION|PERSON|EVENT]
Ex. Jerry flew to Japan
Sarah flies to the party
Delta flies to Europe

146
GenerativeAI and Its Applications
Rule Based Approach pros and cons

• Advantages:
• Rich and expressive rules
• Good results
• Disadvantages:
• Requires huge experience and grammatical knowledge
• Experts to craft rules are expensive
• Highly domain specific ( not portable to a new domain)

148
GenerativeAI and Its Applications
Results for NE Detection ( Evaluation of NER)
Named entity recognizers are evaluated by recall, precision, and F1 measure

Recall is the ratio of the number of correctly labeled responses to the total
that should have been labeled;
precision is the ratio of the number of correctly labeled responses to the
total labeled;
and F-measure is the harmonic mean of the two.

149
GenerativeAI and Its Applications
Typical NER systems
1. The typical architecture for an information extraction system
begins by
segmenting,
tokenizing, and
part-of-speech tagging the text.

2. The resulting data is then searched for specific types of entity.

3. Finally, the information extraction system tries to determine


whether specific relationships hold between those
entities.

NLTK, Stanford NER or Spacy can also be used for Named


Entity Recognition task.
152
GenerativeAI and Its Applications
Text Classification
Naïve Bayes and Its Application
GenerativeAI and Its Applications
Naive Bayes Classifier

• Highly practical Bayesian Learning method


• Comparable performance with neural network and decision tree learning.
• Naive Bayes classifier applies to learning task where each instance x is classified by the
conjunction of attribute value and where the target function f(x) can take any value from
finite set V.
• A set of training examples of the target function is provided and a new instance is
presented in tuple of attribute values (a1,a2,a3,a4,........an)
• The learner/classifier is asked to predict the target value or the classification of the new
instance.
• In this section, we will look at the Bayesian approach to classify a new instance Xnew.
GenerativeAI and Its Applications
Naive Bayes Classifier

• The Bayesian approach is to classify the new instance is to assign the class label that is
the most probable value, VMAP , given attribute values (a1.....an) that describe the
instance is
GenerativeAI and Its Applications
Naive Bayes Classifier

• Apply Baye’s theorem


• using this

• It is easy to estimate P(Vj):


GenerativeAI and Its Applications
Naive Bayes Classifier

VMAP ----------------------(1)

• Naive Bayes Classifier is based on simplifying assumption that the attribute values
(i.e. a1, a2, a3, …, an)are conditionally independent given the target value

• i.e the assumption is that given the target value of the instance, the probability of
observing the conjunction a1,a2,a3,.....an is just the product of the probabilities for
the individual attributes

• Using this in our VMAP equation 1, we get

V MAP =
GenerativeAI and Its Applications
Naive Bayes Classifier

• Lets note few things


• Number of distinct P(ai|Vj) terms that must be estimated from the
training set:'D' is equal to the number of distinct attribute value
times the number of distinct target values
• Based on the frequencies over the training set
• whenever NB assumption of conditional independence is satisfied,
this NB classification=MAP classification
GenerativeAI and Its Applications
Naïve Bayes Classifier- Zero Frequency Problem

Laplace Estimation
GenerativeAI and Its Applications
Naïve Bayes Classifier
GenerativeAI and Its Applications
Special Case( M-estimate of Conditional Probability)
GenerativeAI and Its Applications
Special Case( M-estimate of Conditional Probability)
GenerativeAI and Its Applications

Example 1-Text Classification


GenerativeAI and Its Applications
Example (Text Classification)

➢ consider the given data set AND the task is to classify the sentence
➢ “A very close game” as sports or not sports

➢ In this data set we do not have numbers but we have only text
➢ We need to convert all this text into numbers that we can use for
calculation. HOW?????
sentence class
➢ One solution is to use frequency of words
A great game sports
➢ Ignore word order and sentence construction
➢ Treat every document as a set of words it contains. The election is over Not sports
➢ Now the feature used in this case is the counts of words i.e(words very clean match sports
frequency) a clean but sports
➢ It’s a simplistic approach, but works surprisingly well forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example- (Text Classification)
GenerativeAI and Its Applications
Example-Text Classification

➢ Now, we need to transform the probability we want to calculate


into something that can be calculated using word frequencies.

➢ Bayes Theorem for example:


P(avery
close
game/sp
s)xP(sp
ort
s) ort
P(sp ort
very
close
s/a =
game)
P(a
very
close
game)

➢ since in our classier ,we are just trying to find out which category sentence class
has bigger probability we can discard the divisor A great game sports
➢ This is same for both the categories
The election is over Not sports
➢ we can compare
➢ P(A very close game/sports) x P(sports) very clean match sports
➢ with a clean but sports
➢ P(A very close game/not sports) x P(not sports) forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example -Text Classification
sentence class
➢ The probabilities can be calculated: A great game sports
1. count how many times the sentence
The election is over Not sports
'A very close game' appears in sports category
2. Divide by the total very clean match sports
3. obtain P(a very close game|sports) a clean but sports
forgettable game
➢ PROBLEM: we do not have the 'sentence' in the it was a close election not sports
training set

=>probability is zero
➢ unless every sentence appears in the training set, what
we want to classify, the model won’t classify
GenerativeAI and Its Applications
Example -Text Classification
➢ we assume that every word in a sentence is independent of the
other ones
➢ no longer we will look for entire sentences, but for only
words(individual)

➢ i,e for a sentence “This was a funny party” is same as “funny is


party was this” is same as “party funny this was a”

➢ we can write this as: sentence class


P(a very close game)=P(a)xP(very)xP(close)xP(game) A great game sports
➢ This enables to make the model work well, Now lets apply,
The election is over Not sports
P(a very close game|sports)=
very clean match sports
P(a|sports)xP(very|sports)xP(close|sports)xP(game|sports)
a clean but sports
forgettable game
all these individual words actually show up several times in our
it was a close election not sports
training set,we can do our calculations
GenerativeAI and Its Applications
Example-Text Classification

CALCULATING PROBABILITIES
➢ The final step is just to calculate every probability and see which one
turns to be larger
➢ First: calculate a priori probability for each category,i.e for the sentence
given in the training set
P(sports)=3/5=0.6
P(not sports)=2/5=0.4
➢ calculate P(game/sports) : counting number of times the word game
appears in the sports sample, divided by the total no of words in sports sentence class
i.e it appears twice for 11 words. A great game sports
P(game/sports)=2/11=0.18181
The election is over Not sports
➢ A problem again very clean match sports
➢ the word close does not appear in any sports ,and would lead us 0 when a clean but sports
multiplied with other probability forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example-Text Classification
sentence class
To resolve this we do something called Laplace smoothing
A great game sports
➢ Add 1 to every count so its never zero
➢ To balance it again, add no of possible words to divisor, The election is over Not sports
➢ in our case the possible words are: very clean match sports
{a, great, game, the election, is, over,.......election}= 15
a clean but sports
Applying smoothing we get:-
forgettable game
it was a close election not sports
WORD P(word/sports) P(word/Not
sports)
a (2+1)/(15+11)=3/26 (1+1)/(9+15)=2/24

very (1+1)/(15+11)=2/26 (0+1)/(9+15)=1/24

close (0+1)/(15+11)=1/26 (1+1)/(9+15)=2/24

game (2+1)/(15+11)=3/26 (0+1)/(9+15)=1/24


GenerativeAI and Its Applications
Example -Text Classification
WORD P(word/sports) P(word/Not sports) sentence class
A great game sports
a (2+1)/(15+11)=3/26 (1+1)/(9+15)=2/24 The election is over Not sports
very clean match sports
very (1+1)/(15+11)=2/26 (0+1)/(9+15)=1/24 a clean but sports
forgettable game
close (0+1)/(15+11)=1/26 (1+1)/(9+15)=2/24 it was a close election not sports

game (2+1)/(15+11)=3/26 (0+1)/(9+15)=1/24

Now find if it belongs to sports or not sports class using naïve bayes:-

P(a/sports)xP(very/sports)xP(close/sports)xP(game/sports) P(Sports) =

P(a/not sports)xP(very/not sports)xP(close/not sports)xP(game/not sports) P(Not Sports) =

By this, we successfully classify it as “ category”


GenerativeAI and Its Applications
Advanced Techniques

➢ Removing stop words:


example: a,able,the
a very close game =>>>very close game
➢ Words like election/elected are grouped together and counted as one word
➢ Using n-grams:
instead of counting individual words we can count sequence of words
example : ’clean match’, 'close election’

As you increase the number of grams the context become more relevant but the
complexity increases
➢ TFIDF
term frequency–inverse document frequency, is a numerical statistic
that is intended to reflect how important a word is to a document in a collection or
corpus.
GenerativeAI and Its Applications
Advanced Techniques
GenerativeAI and Its Applications

Example 2-Text Classification


GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
GenerativeAI and Its Applications
Example-Text Classification
Generative AI and its Applications
LLM Architecture and Other Models

Dr. Arti Arya


Department of Computer Science and
Engineering
Generative AI and its Applications
(UE22CS342BA9)

We have covered:
LLM Basics and evolution
NLP: Word Embeddings
POS and NER
Text Classification
We will now cover:-
ELMo, Transformer Anatomy, GPT
BERT,ROBERTa, BART architectures
LLM Architecture
Generative AI and its Applications
(UE22CS342BA9)

ELMo

Dr. Arti Arya


Department of Computer Science and Engineering
ELMo( Embeddings from Language Model)

• ELMo (Embeddings from Language Models) represents a


groundbreaking approach to word embeddings that
fundamentally changed how we represent words in natural
language processing.
• Released in 2018 by researchers at the Allen Institute for
Artificial Intelligence and the University of Washington,
ELMo introduced the concept of contextual word
representations, overcoming the limitations of static word
embeddings.
References:
Deep Contextualized Word Representations (Peters et al., NAACL 2018)
ELMo: Architecture and Working
Bidirectional Processing
❑ ELMo employs a two-layered bidirectional LSTM(BiLSTM)
network that processes text in both forward and backward
directions.
❑ It overcomes the limitations of traditional word
embeddings, which captures both the complex
characteristics of word use (syntax and semantics) and
how these uses vary across different contexts (polysemy).

Deep Contextualized Word Representations (Peters et al., NAACL 2018)


ELMo

References:
Generalized Language Models | Lil'Log
ELMo: Architecture and Working

Bidirectional Processing
❑ The model begins with character-level tokenization,
converting words into character embeddings before
processing them through:
➢ A forward pass capturing preceding context.
➢ A backward pass capturing subsequent context.
➢ Two LSTM layers with residual connections
ELMo: Architecture and Working
Bidirectional Processing
• Given a sequence of N tokens , a forward language
model(LM) computes the probability of the sequence by modeling the
probability of token tk given the history :

• A backward LM is similar to a forward LM, except it runs over the


sequence in reverse, predicting the previous token given the future
context
ELMo: Architecture and Working
Bidirectional Processing

• A biLM combines both a forward and backward LM.


• The formulation jointly maximizes the log likelihood of the forward and
backward directions:

𝑤ℎ𝑒𝑟𝑒 𝜃Ԧ𝐿𝑆𝑇𝑀 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝐿𝑆𝑇𝑀


𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝑎𝑛𝑑 𝜃ശ𝐿𝑆𝑇𝑀 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝐿𝑆𝑇𝑀 𝑛𝑒𝑡𝑤𝑜𝑟𝑘
ELMo: Architecture and Working
• For a given word at position t, its representation is calculated as,
𝐿
(𝑡) (𝑡)
𝐸𝐿𝑀𝑜 = 𝛾 ෍ 𝛼𝑘 𝒉𝑘
𝑘=1
Where L is the total # of layers, 𝛼𝑘 is a learned scalar wt. for layer k, 𝛾 is a learned
scaling factor to adapt to entire ELMo representation.
(𝑡) 𝑡 𝑡
And 𝒉𝑘 = [ℎ𝑘 , ℎ𝑘 ],
𝑡 𝑡
ℎ𝑘 , ℎ𝑘 is forward and backward hidden states of biLSTM n/w and k is the layer no.
The wts 𝜶𝒌 are learned s.t different layers weigh differently to the final word
representation, depending on specific downstream task.
(A downstream task refers to a specific application or task that utilizes pre-trained models or
representations to achieve a particular goal. )
ELMo: Architecture and Working
ELMo

Architecture and Working

Context-Aware Understanding
Dynamic Word Representations:
Unlike traditional word embedding methods like Word2Vec or GloVe, ELMo
generates different representations for the same word based on its context. For
example, the word "trust" receives distinct embeddings in phrases like:
● "I can't trust you"
● "He has a trust fund"
This context sensitivity allows ELMo to handle:
● Polysemy (multiple word meanings)
● Complex linguistic patterns
● Contextual nuances
ELMo

Training and Implementation


ELMo was trained on an impressive corpus of approximately 30 million sentences and 1 billion
words. The training process involves:
Pre-training Phase
● Language modeling task predicting next/previous words
● Character-level processing for handling unknown words
● Bidirectional training for comprehensive context understanding
Fine-tuning Phase
● Task-specific adaptation
● Frozen base parameters with adjustable projection matrix
● Integration with downstream NLP tasks
ELMo: Fine-tuning Phase
Fine-tuning Phase
• Task-specific adaptation:
•Tailor the pre-trained ELMo representations to perform well on a specific
NLP task (e.g., sentiment analysis, named entity recognition).
•This involves training additional layers or parameters that are specific to
the task while leveraging the pre-trained ELMo embeddings.
•(We only train the task-specific layers that are added on top of the pre-
trained ELMo embeddings. This doesn’t mean fine-tuning the ELMo model.
•The task-specific model (such as a neural network classifier) takes the ELMo
embeddings as input features and then fine-tunes only the task-specific layers
(e.g., adjusting the weights of the classifier layer).)
ELMo: Fine-tuning Phase
Fine-tuning Phase

•These representations can be easily added to existing models and


significantly improve the state of the art across six challenging NLP
problems, including question answering,
ELMo: Fine-tuning Phase
Frozen base parameters with adjustable projection matrix:

● The core parameters of the pre-trained ELMo model (i.e., the weights of
the bidirectional language model) are kept fixed during the fine-tuning
phase. This ensures that the rich contextual information captured during
pre-training is preserved.
● A task-specific projection matrix is introduced, which can be adjusted
during fine-tuning. This matrix maps the ELMo embeddings to a space
that is more suitable for the specific task at hand. By adjusting this matrix,
the model can better align the pre-trained embeddings with the
requirements of the downstream task.
ELMo: Fine-tuning Phase
Fine-tuning Phase

● Integration with downstream NLP tasks: Seamlessly incorporate the fine-tuned


ELMo representations into various NLP tasks.
ELMo: An example
ELMo

The architecture of ELMo has also been integrated into


transformer-based models like BERT, GPT which paved the way
for more advanced contextual embeddings.

Limitations:
1. Slower and harder to parallelize than transformer-based models like BERT,
GPT etc.
2. Generates context from sequential forward and backward pass unlike
transformer-based models which uses self-attention to simultaneously
consider all tokens.
3. ELMo’s embeddings are large and computationally expensive due to their
combination of multiple BiLM layers.
NATURAL LANGUAGE PROCESSING
ELMO
THANK YOU
Generative AI and its Applications
(UE22CS342BA9)

Transformer Anatomy

Dr. Arti Arya


Department of Computer Science and Engineering
Disadvantages of RNN

Let’s take an example of Sequence-to-Sequence RNN Model for


machine translation (English to Hindi) for the following sentence: “I want to eat”
• The final state ℎ4 is called as the context
vector which is ideally supposed to store
the entire context of the sentence in a
single vector.
• For huge inputs, the hidden state will not
be able to store long range dependencies
and hence, the Decoder will not be able to
translate well given only the previous
hidden state.
• Another problem Is that its highly
inefficient as it has to process all the
words serially. (not parallelizable)
Attention

“As the animal entered the forest, … blending it into the shadows of
the towering trees.”

Here, “it” refers to “animal” earlier in the sentence.

• RNN’s have a short reference window; if the input is too big, “it” ‘s
reference will not be stored in the context window.

• LSTM’s and GRU’s have a slightly better reference window.


Attention

• RNN’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.

• LSTM’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.

• Attention window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
Attention

Here, the word “its” has a strong reference to the


Word “Law”.

Since, we are finding attention between words of the same


sentence, this is called self-attention.

If we instead computed attention between words of different


sentences, this is called cross-attention.
Eg: attention between every word of “I want to eat” and
“main khana chahta hoon”

Attention tells the model what parts of the sentence to focus


on.

Picture Credit
Attention

• For each word in the sentence, we need to compute a


relational score between the word and every other word
in the sentence.

• The higher the score, the more related they are


contextually or on some parameter.
Attention

Relational Score (Attention Score):


• The relational score is computed as the dot product of the Query
vector of one word and the Key vector of another word.

• This dot product gives a measure of similarity between the two vectors,
which translates to how much attention one word should give to the other.

• The formula to calculate the attention score is:


Attention Score=Q⋅K
Where Q is the Query vector of the current word. K is the Key vector of
another word.
Q⋅K represents the dot product of the two vectors.
Transformer Anatomy

Transformer Architecture
Transformer is a deep learning model introduced in the 2017 paper "Attention Is All You Need". It has
become the foundation for most modern large language models (LLMs) due to its efficiency and
effectiveness in handling sequential data, such as text.
Originally, it was designed for machine translation, where it translated text from one language (e.g.,
English) to another (e.g., German or French).
The Transformer consists of two main parts:
● Encoder: Processes the input text and converts it into a numerical representation (embedding).
● Decoder: Uses the encoded representation to generate the output text, one word at a time.
Transformer Anatomy

Transformer Architecture

A simplified diagram
taken from:
Book: Build a Large Language Model (From Scratch)

References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy

Transformer Architecture

Original diagram from the paper:


Source

References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy

How the Transformer Works


Encoder:
● The encoder takes the input text (e.g., "This is an example")
and converts it into a series of numerical vectors.
● These vectors capture the contextual meaning
of the input text, considering the relationships between words.
● The encoder consists of multiple layers, each
using a mechanism called self-attention to focus
on the most important parts of the input text.

References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy

How the Transformer Works


Decoder:
● The decoder takes the encoded vectors from
the encoder and generates the output text
(e.g., "Das ist ein Beispiel").
● It generates the output one word at a time,
using the context provided by the encoder and the words
it has already generated.
● Like the encoder, the decoder also uses self-attention
to focus on relevant parts of the input and previously
generated output.

References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy
Self-Attention Mechanism
● A key innovation of the Transformer is the self-attention mechanism.
● What it does:
○ It allows the model to weigh the importance of different words in a sequence relative to
each other.
○ For example, in the sentence "The cat sat on the mat," the word "cat" is more
important to "sat" than "mat" when determining the meaning of the sentence.
Why it matters:
○ It enables the model to capture long-range dependencies (e.g., relationships between
words far apart in a sentence).
○ This improves the model's ability to understand context and generate coherent,
contextually relevant output.
Architecture of Transformers

• Embeddings
• Positional Encoding

Encoder Components-
• Multi-Head (Self) Attention
• Fully Connected Feed Forward Network
• Residual Connections & Layer Normalization

Decoder Components-
• Masked Multi-Head (Self) Attention
• Multi-Head (Cross) Attention

• Final Linear & SoftMax Layer


Attention
In the context of the Transformer model, attention refers to a mechanism
that determines which parts of an input sequence are most relevant for
generating the output at a specific time.

Key Ideas:
Self-Attention: Each token in a sequence focuses on all other tokens to
capture relationships and context, regardless of their position.
Query, Key, and Value: These are vector representations of input
tokens. Attention scores are computed as a function of queries and keys,
determining how much "attention" each token pays to others.
Weighted Summation: The final output is a weighted sum of the values,
based on the computed attention scores.

.
Attention and Attention Mechanism
Attention:
➢ Refers to the ability of a model to focus on different parts of the input sequence when
making predictions, especially in sequence-to-sequence tasks like machine translation.
➢ It determines how much weight or importance each element in the input sequence
should have when predicting each element of the output sequence.

Attention Mechanism:
➢ This is used to take care of LONG RANGE DEPENDENCIES.
➢ This is the actual computational framework or algorithm that enables attention. It is the
mathematical structure that allows the model to compute which parts of the input
sequence should be given more importance (i.e., higher weights) when making a
decision at each step.
➢ It is realized through a scaled dot-product attention, where the input tokens are
transformed into query, key, and value vectors, and the attention score is computed
using the dot product of the query and key vectors.
Encoder and Decoder
Encoder is responsible for
➢ processing the input sequence
➢ converting it into a set of continuous representations (or embeddings)
➢ consists of N identical layers, each of which has two main sub-components:
➢ Self-Attention Layer: This layer allows the encoder to consider the entire sequence of input
tokens when computing its representation.
➢ Every token in the input can "attend" to all other tokens in the sequence, capturing
dependencies and relationships regardless of distance between tokens.
➢ This is done through the scaled dot-product attention mechanism.
The self-attention mechanism works by computing three vectors for each token in the
sequence:
•Query (Q): Represents the current token.
•Key (K): Represents all tokens.
•Value (V): Represents the information that will be passed through
Attention is computed by taking the dot product of the query and key vectors, scaling the result, and
applying a softmax function to generate attention weights. These weights are used to compute a
weighted sum of the value vectors, which forms the new representation for each token.
Encoder and Decoder

➢ Feed-Forward Neural Network (FFN): After the self-attention layer,


➢ each token representation is passed through a position-wise feed-forward neural
network.

➢ This is applied to each token independently and consists of two fully connected layers
with a ReLU activation function in between.

➢ The FFN allows for more complex transformations of the token representations.

Decoder is responsible for


➢ Decoder is responsible for generating the output sequence from the continuous
representations produced by the encoder.
Encoder and Decoder
Decoder is responsible for
➢ Decoder is responsible for generating the output sequence from the continuous
representations produced by the encoder.

Like the encoder, the decoder consists of N identical layers, with each layer having three
main sub-components:

•Masked Self-Attention Layer: In the decoder, the self-attention mechanism is "masked"


to prevent attending to future tokens in the output sequence. The masked self-attention
ensures that the prediction for each token is only dependent on the previously generated
tokens, not future ones.
Encoder and Decoder

•Encoder-Decoder Attention Layer: This layer performs attention over the encoder's
output.
• It allows each token in the decoder to focus on different parts of the input sequence
when making predictions.

•Feed-Forward Neural Network (FFN): Like in the encoder, the output of the attention
layers is passed through a position-wise feed-forward network to transform the
representations.

Each decoder layer also includes residual connections and layer normalization, which
help improve training by mitigating the vanishing gradient problem and ensuring
stable learning
Generative AI and its Applications
(UE22CS342BA9)

GPT
GPT

Generative Pretrained Transformer


As discussed earlier, GPT is based on the
Decoder component of the
transformer architecture.

At its heart, the model processes input


tokens through multiple stacked
transformer layers, each containing
self-attention mechanisms and
feedforward neural networks

References:
Book: Build a Large Language Model (From Scratch)
GPT

Generative Pretrained Transformer


As discussed earlier, GPT is based on the
Decoder component of the
transformer architecture.

At its heart, the model processes input


tokens through multiple stacked
transformer layers, each containing
self-attention mechanisms and
feedforward neural networks

References:
Book: Build a Large Language Model (From Scratch)
GPT
Key Components
Token Processing
When text enters the system, it undergoes two crucial encoding steps:
● Token embeddings convert words into vector representations
● Positional encodings add sequence information to maintain word order
Self-Attention Mechanism
The self-attention mechanism is what gives GPT its contextual understanding capabilities. It processes input
through three main components:
● Query vectors
● Key vectors
● Value vectors
These vectors work together to compute attention scores, allowing the model to weigh the importance of different
words in relation to each other

References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
NATURAL LANGUAGE PROCESSING
GPT(Generative Pre-training Transformer)
GPT Architecture

Let's take a closer look at the general GPT (Generative Pretrained Transformer).
GPT models are highly capable text completion models.
They can perform a variety of tasks beyond text generation, such as:
● Spelling correction
● Text classification
● Language translation
This versatility is remarkable, considering that GPT models are trained on a relatively simple task: next-
word prediction.

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture

Next-Word Prediction Task


● The next-word prediction task is a form of self-supervised learning.
○ Self-supervised learning is a type of self-labeling where explicit labels for training data are not
required.
● Instead of manually labeling data, the structure of the data itself is used:
○ The next word in a sentence or document serves as the label the model predicts.
● Advantages of this approach:
○ Labels are created "on the fly."
○ Enables the use of massive unlabeled text datasets for training.
● This approach allows GPT models to learn from vast amounts of data, improving their performance
and generalization

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture

GPT Architecture
● The GPT architecture is relatively simple compared to the original Transformer architecture (More on
this later).
● It consists of only the decoder part (more on this in the next section) of the Transformer model,
without the encoder.
● Key characteristics of GPT architecture:
○ It is a decoder-style model.
○ It generates text by predicting one word at a time, making it an autoregressive model.

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture

Autoregressive Models
● GPT models are considered autoregressive because:
○ They incorporate their previous outputs as
inputs for future predictions.
○ Each new word is chosen based on the
sequence of words that precedes it.
● This autoregressive nature improves the
coherence of the generated text.

References:
Book: Build a Large Language Model (From Scratch)
GPT
Processing Pipeline
Input Processing
The model first tokenizes input text into smaller units and converts them into embeddings. These embeddings then
receive positional information to maintain sequence order.
Transformer Layers
Each transformer layer contains:
● A multi-head self-attention mechanism
● A feed-forward neural network
● Layer normalization components
● Residual connections
Output Generation
The final layer produces probability distributions over the vocabulary, enabling the model to predict the most likely next
token in a sequence

References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
GPT

Generative Pretrained Transformer


Training Methodology
GPT models undergo a two-phase training process:
Pre-training
During this phase, the model learns from vast amounts of text data, developing an
understanding of language patterns and relationships.
Fine-tuning
The model is then refined on specific tasks with human feedback, improving its ability to
generate contextually appropriate responses
GPT

Generative Pretrained Transformer

Figure:
Two Stage Process

References:
Book: Build a Large Language Model (From Scratch)
GPT

Generative Pretrained Transformer


Recent Developments
Modern GPT iterations have seen exponential growth in their capabilities.
GPT-4, o1, o3 : while its exact architecture remains private, demonstrates significant
improvements in reasoning and contextual understanding compared to its
predecessors

Emergent Behaviour:
Emergent behavior refers to skills or abilities that a model develops naturally during training,
even though they were not directly taught or targeted. Exposure to massive amounts of
multilingual data in various contexts allows models like GPT to "learn" translation patterns
between languages. This means GPT can perform translation tasks without being explicitly
trained for them
Generative AI and its Applications
(UE22CS342BA9)

BERT
BERT

Bidirectional Encoder Representations from Transformers


BERT utilizes the encoder component of the Transformer architecture, which allows it to process text in a
unique bidirectional manner.
Unlike traditional models that read text either left-to-right or right-to-left, BERT analyzes text from both
directions simultaneously. This bidirectional approach enables BERT to grasp context more effectively and
accurately understand language nuances.
Introduced by Google AI in October 2018
BERT

Bidirectional Encoder Representations from Transformers


Example: In "The bank is by the river"
● Understands "bank" using both:
○ Previous words: "The"
○ Following words: "is by the river"
BERT

How BERT Learns Language: Two Key Training Methods


1. Masked Language Modeling (MLM)
Think of this like a sophisticated version of fill-in-the-blanks.
For example:
● Original: "The cat sits on the mat"
● Masked: "The [MASK] sits on the mat"
BERT looks at all words (before and after the mask) to figure out what word should go in the blank.
During training:
● It masks 15% of words randomly
● Sometimes it replaces words with incorrect ones to learn better
● It learns to predict the original word by understanding the entire context
BERT

How BERT Learns Language: Two Key Training Methods


Masked Language Modeling (MLM)

References:
Book: Build a Large Language Model (From Scratch)
BERT

How BERT Learns Language: Two Key Training Methods


2. Next Sentence Prediction (NSP)
This is like teaching BERT to understand if two sentences naturally follow each other.
For instance:
● Sentence A: "I love ice cream"
● Sentence B: "It's my favorite dessert" (This naturally follows)
vs.
● Sentence B: "The car needs gas" (This doesn't follow)
BERT learns to predict whether sentence B truly follows sentence A in the original text.
BERT

Bidirectional Encoder Representations from Transformers


Architecture
BERT uses an encoder-only transformer architecture consisting
of four key modules:
● A tokenizer that converts text into integer sequences
● An embedding layer that transforms tokens into vectors
● A transformer encoder stack with self-attention
● A task head that produces probability distributions
over tokens
BERT
BERT

Bidirectional Encoder Representations from Transformers


BERT's Input Processing
BERT processes text in a special way:
● Every word gets three types of information:
1. The word itself (Token Embedding)
2. Which sentence it belongs to (Segment Embedding)
3. Where it appears in the sentence (Position Embedding)

Special markers are used:


● [CLS] at the start (like raising a flag saying "pay attention, new text coming")
● [SEP] between sentences (like a boundary marker)
● [MASK] for hidden words during training
BERT

Bidirectional Encoder Representations from Transformers


Fine-tuning: Teaching BERT New Tasks
After BERT learns language generally, it can be taught specific tasks, like:
● Understanding if a movie review is positive or negative
● Finding answers in a text
● Identifying names and places in sentences
This is like taking a well-educated student (pre-trained BERT) and giving them specialized training for a
specific job.
Generative AI and Its Application

Transformers

Dr. Arti Arya


Department of Computer Science and Engineering
Disadvantages of RNN

Let’s take an example of Sequence-to-Sequence RNN Model for


machine translation (English to Hindi) for the following sentence: “I want to eat”

• The final state ℎ4 is called as the context


vector which is ideally supposed to store
the entire context of the sentence in a
single vector.
• For huge inputs, the hidden state will
not be able to store long range
dependencies and hence, the Decoder
will not be able to translate well given
only the previous hidden state.
• Another problem Is that its highly
inefficient as it has to process all the
words serially. (not parallelizable)
Attention

“As the animal entered the forest, … blending it into the shadows of
the towering trees.”

Here, “it” refers to “animal” earlier in the sentence.


• RNN’s have a short reference window; if the input is too big, “it” ‘s
reference will not be stored in the context window.

• LSTM’s and GRU’s have a slightly better reference window.


Attention

• RNN’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.

• LSTM’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.

• Attention window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
Attention

Here, the word “its” has a strong reference to the


Word “Law”.

Since, we are finding attention between words of the same


sentence, this is called self-attention.

If we instead computed attention between words of different


sentences, this is called cross-attention.
Eg: attention between every word of “I want to eat” and “main
khana chahta hoon”

Attention tells the model what parts of the sentence to focus on.

Picture Credit
Attention

For each word in the sentence, we need to compute a relational score between
the word and every other word in the sentence.
The higher the score, the more related they are contextually or on some parameter.

Table Credit
MACHINE LEARNING
Attention

To get a better intuition of why we’re doing this, let’s take a look at how
Information retrieval systems work in particular for google search.
MACHINE LEARNING
Attention
• The database stores <key, value> pairs of documents.
• Keys can be specific characteristics / terms in the document and values
will be the document itself.
• For e.g. on previous slide, we take Keys as the title of the web page and the value
is the web page itself.

• When the user submits a Query, a similarity score is computed for each Key with
the Query and the top 𝑁 ranked keys are returned.

• When we click on a link, we get the Value of the Key.

• Similarly, if we represent each word in a sentence as a vector- we can calculate cosine


similarity between each vector and get an attention matrix.
• This attention matrix acts as a filter and can be applied to the original sentence to give more
focus on certain important parts of the sentence. Reference
MACHINE LEARNING
Attention
MACHINE LEARNING
Attention
Let each word be represented by a vector ℎ𝑖 of dimension 1 × 𝑑𝑚𝑜𝑑𝑒𝑙 where 𝑖
is the index of the word in the sentence.
To compute the Query, Key & Value vectors of the word, we pass the word vector
through a linear layer –
𝑞𝑖 = ℎ𝑖 × 𝑊𝑞 where 𝑊𝑞 , 𝑊𝑘 , 𝑊𝑣 are
linear transformation
𝑘𝑖 = ℎ𝑖 × 𝑊𝑘 matrices which are
𝑣𝑖 = ℎ𝑖 × 𝑊𝑣 learned by the model.
The dimensions are-
𝑊𝑞 − 𝑑𝑚𝑜𝑑𝑒𝑙 × 𝑑𝑞 , so 𝑞𝑖 is of dimension 1 × 𝑑𝑞 .
𝑊𝑘 − 𝑑𝑚𝑜𝑑𝑒𝑙 × 𝑑𝑘 , so 𝑘𝑖 is of dimension 1 × 𝑑𝑘 .
𝑊𝑣 − 𝑑𝑚𝑜𝑑𝑒𝑙 × 𝑑𝑣 , so 𝑣𝑖 is of dimension 1 × 𝑑𝑣 .
Generally, 𝑑𝑞 = 𝑑𝑘 = 𝑑𝑣 < 𝑑𝑚𝑜𝑑𝑒𝑙 as it benefits the model for having a lower dimension
for calculations and performance.
MACHINE LEARNING
Attention ( Cosine similarity between Query and Key vectors)

To compute the cosine similarity between the Query and Key vectors, we
perform a dot product multiplication between them.

Assuming 𝑛 words in the sentence:


For word ℎ𝑖 , the attention scores with respect to every other word in the sentence is-
𝑎𝑖 = 𝑞𝑖 ∙ 𝑘1 , 𝑞𝑖 ∙ 𝑘2 , … , 𝑞𝑖 ∙ 𝑘𝑛
𝒂
𝒆 𝒊𝒋
To convert scores to probability, we apply SoftMax (𝒂𝒊𝒋 = 𝒂 )
σ𝒋 𝒆 𝒊𝒋
across the vector.

We also divide the scores by 𝑑𝑘 to prevent exploding gradients.


(𝑑𝑘 is the dimension of the Key vector)
Attention

We can compute the values for 𝑎1 , 𝑎2 , … , 𝑎𝑛 parallelly if we combine all the word vectors
into a single matrix and perform matrix multiplication between the Query and Key matrices
Let 𝐻 = ℎ1 , ℎ2 , … , ℎ𝑛 and hence is of dimension 𝑛 × 𝑑𝑚𝑜𝑑𝑒𝑙
𝑄 = 𝐻 × 𝑊𝑞
𝐾 = 𝐻 × 𝑊𝑘
𝑉 = 𝐻 × 𝑊𝑣
hence 𝑄, 𝐾, 𝑉 matrices are of dimension 𝑛 × 𝑑𝑞 , 𝑛 × 𝑑𝑘 , 𝑛 × 𝑑𝑣 respectively.
Attention
• Hence the attention filter,
𝑸𝑲𝑻
𝑨 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙
𝒅𝒌
where, the dimensions of 𝐴 are 𝑛 × 𝑛. (𝑛 is the no. of words in the input)
• This matrix now acts as a filter and tells the model which parts of
the sentence is more important than others and which are less
important.

• We apply this filter to the 𝑉 matrix to get the final attention,


𝑄𝐾 𝑇
𝑍 = 𝐴𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉
𝑑𝑘
Where, the dimensions of 𝑍 are 𝑛 × 𝑑𝑣 .
• Since 𝑑𝑘 scales the values of 𝑄𝐾 𝑇 , it is called “The scaled dot-
product attention.”
Picture Credit
Attention

• Attention matrices can be compared to convolutional filters or kernels in


CNNs.

• Each filter we use in a CNN learns an abstract representation and captures


different information than other filters.

• Similarly, we can have different attention matrices ( multiple attention


heads) with different linear transformations 𝑊𝑞 , 𝑊𝑘 , 𝑊𝑣 for each matrix.
Attention

• Hence, we have multiple “heads”, each of which has its own


transformation (𝑊𝑞𝑖 , 𝑊𝑘𝑖 , 𝑊𝑣𝑖 ) and computes its own attention matrix.

• In the end, we combine all the attention matrices from the different
heads and pass it through a last linear transformation (𝑊𝑂 ).

• This is called as Multi-Head Attention where each head computes its


attention independently and hence can be parallelized.
Attention

• Each head outputs a 𝑛 × 𝑑𝑣 Attention matrix.


• With ℎ heads, the output of concatenating all
matrices is a 𝑛 × (𝑑𝑣 ∗ ℎ).

• The Linear Transformation 𝑊𝑂 will be of


dimensions 𝑑𝑣 ∗ ℎ × 𝑑𝑚𝑜𝑑𝑒𝑙
• The output of the Multi-Head attention module is
a 𝑛 × 𝑑𝑚𝑜𝑑𝑒𝑙 matrix.

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐻 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1 , … , ℎ𝑒𝑎𝑑𝑛 × 𝑊𝑂


𝑄𝑖 𝐾𝑖𝑇
Where ℎ𝑒𝑎𝑑𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑖
𝑑𝑘
Where 𝑄𝑖 = 𝐻𝑊𝑞𝑖 , 𝐾𝑖 = 𝐻𝑊𝑘𝑖 , 𝑉𝑖 = 𝐻𝑊𝑣𝑖

Picture Credit
Attention

Different Attentions
from the Multi-Head
Attention module.

Try out this


interactive demo-
Link

Picture Credit
Architecture of Transformers

Now that we know what Attention is, Let’s look at the architecture of Transformers.

Let’s take the problem of text autogeneration/autocompletion-


Given an input sentence, the task is to complete it.

For example:
Input – The dog safely
Output – crossed the road.

Our model has two parts.


• The encoder takes the input as a whole and generates context in its own representation.
• The decoder takes the context from the encoder, along with the previous word, it
generates a new word. Most decoders are auto-regressive. (autoregressive means they generate
output sequences one token at a time, using previously generated tokens as context for predicting the next token.
This process continues until the entire sequence is generated.)
Architecture of Transformers

• First the sentence “the dog safely” is passed to the encoder which
generates a context and gives this to the decoder.

• In the 1st run, the decoder is given a special <start> token. It calculates a probability for
each word in the model’s vocabulary and selects the word with the highest probability.

• Let’s say this word is “crossed”. Then in the 2nd run, the decoder is given “crossed” and
the process repeats till a special <end> token.
Architecture of Transformers

• In the 1st run, the decoder is given a special <start> token. It calculates a probability for
each word in the model’s vocabulary and selects the word with the highest probability.

• Let’s say this word is “crossed”. Then in the 2nd run, the decoder is given “crossed” and
the process repeats a special <end> token.
preview slide to watch animation
Architecture of Transformers

• During Training however, we need to give both the input and the expected
output to the model.

• For an input “the dog safely”, let’s say the decoder generated the word
“touched”.

• If we continued to give the model it’s own wrong predicted word, future
generated words would be related to this wrong word.

• So, instead of giving the model it’s own predicted word,


we always give the ground truth as the input to the decoder.

• This is called as Teacher Forcing.


Architecture of Transformers

preview slide to watch animation


Architecture of Transformers

• Embeddings
• Positional Encoding

Encoder Components-
• Multi-Head (Self) Attention
• Fully Connected Feed Forward Network
• Residual Connections & Layer Normalization

Decoder Components-
• Masked Multi-Head (Self) Attention
• Multi-Head (Cross) Attention

• Final Linear & SoftMax Layer


Embeddings
• When we give a sentence to the transformer, we first need to split it into
tokens which are part of the transformer’s vocabulary. We then get the
corresponding vector representation (embeddings) of the token.

• The original Transformer paper uses a vocabulary of 37,000 tokens.


The dimensions of the embeddings used is 𝑑𝑚𝑜𝑑𝑒𝑙 = 512.

• For our example, we’ll use a vocabulary of only 5 words [‘crossed’, ‘dog’,
‘road’, ‘safely’, ‘the’].

• The vector embedding’s are randomly initialized in the beginning and are
learnt during training.

• The dimensions we’ll use is 𝑑𝑚𝑜𝑑𝑒𝑙 = 4.


Embeddings
Initial Embeddings:
Token Token ID Token Embedding
Word embeddings are numerical
‘crossed’ 0 0.62, 0.79, 0.26, 0.43 representations of words in a
‘dog’ 1 0.42, 0.34, 0.34, 0.49 continuous vector space, where each
dimension captures some aspect of the
‘road’ 2 [0.04, 0.54, 0.51, 0.86] word's meaning or usage. The values in
the vector for the word "crossed"
‘safely’ 3 [0.56, 0.98, 0.86, 0.42] 0.62, 0.79, 0.26, 0.43 represent its
position in this high-dimensional space.
‘the’ 4 [0.86, 0.34, 0.49, 0.57]

So, for an Input of “the dog safely”, the following input matrix 𝐻𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 will be generated :

‘the’ 0.86 0.34 0.49 0.57


‘dog’ 0.42 0.34 0.34 0.49 = 𝐻3×4
‘safely’ 0.56 0.98 0.86 0.42
Positional Encoding

We need some way of encoding the positional data of the words in the sentence

For example: consider the two sentences-


The coach praised the player because he worked hard.
The coach because he worked hard praised the player.

In the 1st sentence, ‘he’ refers to the player.


But with only a change of position of words,
in the 2nd sentence, ‘he’ refers to the coach.

Hence, we use the following to encode the position of the word along with the embedding-
𝑝𝑜𝑠
For even position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 2𝑖 , where 𝑖 is the dimension.
ൗ𝑑
10000 𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
For odd position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖+1 = cos 2𝑖ൗ , and 𝑝𝑜𝑠 is the position.
10000 𝑑𝑚𝑜𝑑𝑒𝑙
Positional Encoding

𝑝𝑜𝑠
For even position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 2𝑖ൗ , where 𝑖 is the dimension.
10000 𝑑𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
For odd position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖+1 = cos 2𝑖ൗ , and 𝑝𝑜𝑠 is the position.
10000 𝑑𝑚𝑜𝑑𝑒𝑙

So, for our example [‘the’, ‘dog’, ‘safely’]:

‘the’ is in position 0 -

𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻0 0.86 0.34 0.49 0.57
sin 0/10000(2∗0/4) cos 0/10000(2∗0/4) sin 0/10000(2∗1/4) cos 0/10000(2∗1/4)
𝑃𝐸 0 1 0 1
Positional Encoding
‘dog’ is in position 1 -

𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻1 0.42 0.34 0.34 0.49
sin 1/10000(2∗0/4) cos 1/10000(2∗0/4) sin 1/10000(2∗1/4) cos 1/10000(2∗1/4)
𝑃𝐸 0.8415 0.5403 0.01 0.9999

‘safely’ is in position 2 -

𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻2 0.56 0.98 0.86 0.42
sin 2/10000(2∗0/4) cos 2/10000(2∗0/4) sin 2/10000(2∗1/4) cos 2/10000(2∗1/4)
𝑃𝐸 0.9093 −0.4161 0.02 0.9998
Positional Encoding
We finally add the 𝑃𝐸 with the input embeddings:

𝐻3×4 = 𝐻3×4 + 𝑃𝐸

‘the’ 0.86 0.34 0.49 0.57 0 1 0 1


‘dog’ 0.42 0.34 0.34 0.49 + 0.8415 0.5403 0.01 0.9999
‘safely’ 0.56 0.98 0.86 0.42 0.9093 −0.4161 0.02 0.9998

‘the’ 0.86 1.34 0.49 1.57


𝐻3×4 = ‘dog’ 1.2615 0.8803 0.35 1.4899
‘safely’ 1.4693 0.5639 0.88 1.4198
Encoder – Multi-Head (Self) Attention
Encoder – Multi-Head (Self) Attention

The original paper uses 8 heads, for our example, we’ll use only 2 heads.
𝑑𝑚𝑜𝑑𝑒𝑙
The original paper uses 𝑑𝑞 = 𝑑𝑘 = 𝑑𝑣 = = 64, we’ll use 2 as our dimension.

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐻 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1 , … , ℎ𝑒𝑎𝑑𝑛 × 𝑊𝑂


𝑄𝑖 𝐾𝑖𝑇
Where ℎ𝑒𝑎𝑑𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑖
𝑑𝑘
Where 𝑄𝑖 = 𝐻𝑊𝑞𝑖 , 𝐾𝑖 = 𝐻𝑊𝑘𝑖 , 𝑉𝑖 = 𝐻𝑊𝑣𝑖

𝑊𝑞0 , 𝑊𝑘0 , 𝑊𝑣0 and 𝑊𝑞1 , 𝑊𝑘1 , 𝑊𝑣1 are initialized randomly.

Let’s compute attention for one head


Encoder – Multi-Head (Self) Attention

Assuming we get:

1.2922 1.6889
𝑄0 = 𝐻𝑊𝑞0 = 1.2541 2.5407
0.6617 2.7296

0.0182 0.4839
𝐾0 = 𝐻𝑊𝑘0 = 0.5597 0.8135
0.4560 1.2663

2.6708 −1.1731
𝑉0 = 𝐻𝑊𝑣0 = 2.6051 −0.5519
3.5441 −1.0239
Encoder – Multi-Head (Self) Attention

1.2922 1.6889 0.8408 2.0970 2.7278


0.0182 0.5597 0.4560
𝑄0 𝐾0𝑇 = 1.2541 2.5407 × = 1.2523 2.7686 3.7890
0.4839 0.8135 1.2663
0.6617 2.7296 1.3330 2.5908 3.7582

0.5946 1.4828 1.9289


𝑄0 𝐾0𝑇
= 0.8855 1.9577 2.6792
𝑑𝑘 0.9425 1.8320 2.6574

0.1383 0.3363 0.5253


𝑄0 𝐾0𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 = 0.1007 0.2941 0.6052
𝑑𝑘 0.1112 0.2707 0.6180
Encoder – Multi-Head (Self) Attention

𝑄0 𝐾0𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉=
𝑑𝑘

0.1383 0.3363 0.5253


2.6708 2.6051 3.5441
0.1007 0.2941 0.6052 × −1.1731 −0.5519 −1.0239
0.1112 0.2707 0.6180

3.1075 −0.8858
= 3.1800 −0.9001
3.1928 −0.9127
Encoder – Multi-Head (Self) Attention

Hence, ℎ𝑒𝑎𝑑0 = 3.1075 −0.8858


3.1800 −0.9001
3.1928 −0.9127

Similarly, let’s say ℎ𝑒𝑎𝑑1 = 0.3869 −4.1866


0.3944 −4.2459
0.4358 −4.4043

∴ 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝐻 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑0 , … , ℎ𝑒𝑎𝑑𝑛 × 𝑊𝑂 =


3.1075 −0.8858 0.3869 −4.1866
3.1800 −0.9001 0.3944 −4.2459 × 𝑊𝑂 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑑 𝑟𝑎𝑛𝑑𝑜𝑚𝑙𝑦 =
3.1928 −0.9127 0.4358 −4.4043
“the” 3.1521 3.9323 0.5346 −3.9235
“dog” 3.1898 3.9741 0.4894 −3.9718
“safely” 3.4267 4.1227 0.6870 −4.1507
Encoder – Fully Connected Feed Forward Network
Encoder – Fully Connected Feed Forward Network

After the multi-head attention, the transformed 𝑛 × 𝑑𝑚𝑜𝑑𝑒𝑙 matrix is passed


to a feed forward network. This is applied to each position (1 × 𝑑𝑚𝑜𝑑𝑒𝑙 ) vector
separately and identically. It consists of two linear transformations with a ReLU
activation in between which is equivalent to the following:

𝐹𝐹𝑁 𝑥 = max 0, 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2

The dimensionality of input and output is 𝑑𝑚𝑜𝑑𝑒𝑙 and the inner-layer has a
dimensionality 𝑑𝑓𝑓 . In the original paper, 𝑑𝑓𝑓 = 2048.

“the” 3.1521 3.9323 0.5346 −3.9235 “the” −2.3621 14.1629 −7.4911 −21.0523
“dog” 3.1898 3.9741 0.4894 −3.9718 “dog” −2.2789 14.1132 −7.5580 −21.4520
“safely” 3.4267 4.1227 0.6870 −4.1507 “safely” −2.5964 15.0122 −7.9764 −21.3564
Encoder – Residual Connections

After multiple transformations, there’s a chance that after undergoing many


transformations, the original patterns in the data is lost.

To prevent this, we have “skip connections” or “residual connections”.


This essentially just adds the original matrix to the transformed matrix, so that the
resultant matrix has components from both of them, allowing gradients to flow through
the network.
Encoder – Layer Normalization

Along with having residual connections, we also normalize the data to


increase accuracy and enhance performance.

We perform layer normalization:


mean std
“the” −2.3621 14.1629 −7.4911 −21.0523 −4.1857 14.5534
“dog” −2.2789 14.1132 −7.5580 −21.4520 −4.2939 14.6962
“safely” −2.5964 15.0122 −7.9764 −21.3564 −4.2293 15.0586

“the” 0.1253 1.2608 −0.2271 −1.1589


“dog” 0.1371 1.2525 −0.2221 −1.1675
“safely” 0.1084 1.2778 −0.2488 −1.1374
Encoder – Final Architecture

Till now, we’ve seen operations which form one layer of the encoder.
Each layer consists of –
The Encoder is composed of a stack of multiple such layers.
In the original paper, it consists of 6 layers.

The purpose is to encode the input into a continuous


representation with attention information.
Decoder – Architecture
Decoder – Masked Multi-Head (Self) attention

In the decoder block, during training, we provide the expected target (ground
truth). But to enforce the “Teacher Forcing” approach, we use masks to hide
future information so that the decoder only uses previous and current words.

We perform this operation after scaling Q𝐾 𝑇 by 𝑑𝑘 and before performing SoftMax.


Decoder – Masked Multi-Head (Self) attention
<start> ‘crossed’ ‘the’ ‘road’
<start> 0.8408 2.0970 2.7278 2.7686
𝑄𝐾 𝑇
= ‘crossed’ 0.1084 1.2778 −0.2488 −1.1374
𝑑𝑘 ‘the’ 1.2523 2.7686 3.7890 0.4894
‘road’ 1.2608 −0.2271 −1.1589 0.6870

<start> ‘crossed’ ‘the’ ‘road’


<start> 0.8408 −∞ −∞ −∞
‘crossed’ 0.1084 1.2778 −∞ −∞
‘the’ 1.2523 2.7686 3.7890 −∞
‘road’ 1.2608 −0.2271 −1.1589 0.6870
Decoder – Masked Multi-Head (Self) attention

By placing −∞ in the places where we don’t want the model to know the
attention scores, SoftMax computes them as 0, and hence future words are
successfully hidden from the model.

<start> ‘crossed’ ‘the’ ‘road’ <start> ‘crossed’ ‘the’ ‘road’


<start> 0.8408 −∞ −∞ −∞ <start> 1 0 0 0
‘crossed’ 0.1084 1.2778 −∞ −∞ ‘crossed’ 0.2370 0.7630 0 0
‘the’ 1.2523 2.7686 3.7890 −∞ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ‘the’ 0.0550 0.2504 0.6946 0
‘road’ 1.2608 −0.2271 −1.1589 0.6870 ‘road’ 0.5324 0.1202 0.0474 0.3
Decoder – Multi-Head (Cross) attention

The second Multi-Head attention module in the layer


is slightly different.

It takes the Keys & Values from the encoder module


and the Query from the previous multi-head attention
module in the layer.

Since the Query (from decoder) is being performed on


a Key (from encoder), this is called cross-attention.
Final Layer

• The Decoder is also composed of a stack of identical layers. In the original


paper, it is 6. The output of the top-most decoder layer is passed through a
Linear Layer which acts as a classifier to compute the scores of each word
in the model’s vocabulary.

• The last word embedding (1 × 𝑑𝑚𝑜𝑑𝑒𝑙 ) of the last output layer is passed
through a linear transformation 𝑊𝐷 of size 𝑑𝑚𝑜𝑑𝑒𝑙 × 𝑉𝑜𝑐𝑎𝑏 𝑠𝑖𝑧𝑒.

• This transformation outputs a vector which has a score for each word in the
vocabulary. SoftMax is applied on the vector and the word with the highest
probability is chosen as the predicted word.

• The predicted word is then passed as input to the decoder again until the
<end> token is chosen. (auto-regressive)
Final Layer
Types of Transformers

In the Transformer architecture,


different variations serve specific
purposes.

These can be broadly categorized as:


• Encoder Only Transformer( BERT)
• Decoder Only Transformer (GPT)
• Encoder + Decoder Transformer

Picture Credit
Types of Transformers – Encoder Only Transformer

These models use only the encoder part of the transformer architecture.
The encoder focuses on understanding the input sequence by attending to
all parts of it, and learning deep contextualized representations of the input.
The output from the encoder can then be used for various downstream tasks.

Use Cases:
• Text Classification: Classifying the sentiment of a sentence or detecting spam.
• Named Entity Recognition (NER): Identifying entities like names, places, or
organizations in text.
• Token Classification: Part-of-speech tagging, labeling each word/token in a sentence.
• Masked Language Modeling (MLM): Predicting missing words in a sentence.

Examples:
• BERT (Bidirectional Encoder Representations from Transformers)
Types of Transformers – Decoder Only Transformer

These models use only the decoder part of the transformer architecture.
The encoder focuses on generating sequences, often used for autoregressive tasks where
the model generates the next word or token in a sequence, given the preceding ones.
Decoders rely on self-attention for past tokens and generate outputs one token at a time.

Use Cases:
• Text Generation: Autoregressively generating text, like completing a sentence or story
generation.
• Language Modeling: Predicting the next word or sequence of words in a sentence.
• Code Generation: Automatically generating code from a prompt or comments.
• Dialogue Systems: Used in chatbot and conversational systems to generate responses.

Examples:
• GPT (Generative Pretrained Transformer) & GROVER
Types of Transformers – Encoder + Decoder Transformer

These models use both the encoder and decoder components.


The encoder processes the input sequence and compresses it into a representation.
The decoder then takes this representation and generates an output sequence.
This architecture is most commonly used in tasks that require converting one
sequence into another, often of different lengths.

Use Cases:
• Machine Translation: Translating sentences from one language to another.
• Summarization: Converting long articles into shorter summaries.
• Text-to-Text: Tasks like sentence paraphrasing or converting a question into an answer.
• Speech-to-Text: Converting spoken language into written text (speech recognition).

Examples:
• T5 (Text-to-Text Transfer Transformer) & mBART (Multilingual BART)
MACHINE LEARNING
References

• Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need,
2023. – Link
• Mitesh Khapra (IITM Professor), Transformer Slides – Link
• Polo Club (Georgia Tech), Interactive Transformer Explainer – Link
• Michael Nguyen (The AI Hacker), Illustrated Guide to Transformers - Link
• Grant Sanderson (3b1b), Neural Networks - Link
• Yannic Kilcher, Explaining “Attention is all you need paper” - Link
• Batool Haider (Hedu AI), A visual guide to transformers – Link
• Tom Yeh, AI by Hand - Link
• Fareed Khan, Solving Transformers by Hand - Link
THANK YOU
Generative AI and its Applications
LLM Architecture and Models
BERT
Dr. Arti Arya
Department of Computer Science and
Engineering
BERT

Bidirectional Encoder Representations from Transformers


• BERT utilizes the encoder component of the Transformer architecture, which allows it to
process text in a unique bidirectional manner.
• Unlike traditional models that read text either left-to-right or right-to-left, BERT analyzes text from
both directions simultaneously.
• This bidirectional approach enables BERT to grasp context more effectively and accurately
understand language nuances.

Introduced by Google AI in October 2018


BERT

Bidirectional Encoder Representations from Transformers


Example: In "The bank is by the river"
● Understands "bank" using both:
○ Previous words: "The"
○ Following words: "is by the river"
BERT

How BERT Learns Language: Two Key Training Methods


1. Masked Language Model
2. Next Sentence Prediction
BERT
How BERT Learns Language: Two Key Training Methods
Masked Language Modeling (MLM)
Think of this like a sophisticated version of fill-in-the-blanks.
For example:
● Original: "The cat sits on the mat"
● Masked: "The [MASK] sits on the mat“

BERT looks at all words (before and after the mask) to figure out what word
should go in the blank.

During training:
● It masks 15% of words randomly
● Sometimes it replaces words with incorrect ones to learn better
● It learns to predict the original word by understanding the entire context
BERT

References:
Book: Build a Large Language Model (From Scratch)
BERT

2. Next Sentence Prediction (NSP)


This is like teaching BERT to understand if two sentences naturally follow each other.
For instance:
● Sentence A: "I love ice cream"
● Sentence B: "It's my favorite dessert" (This naturally follows)
vs.
● Sentence B: "The car needs gas" (This doesn't follow)

BERT learns to predict whether sentence B truly follows sentence A in the


original text.
BERT

• Two sentences A and B


with N and M tokens are
separated by [SEP]
token.

• Some tokens are


masked for Masked
language modelling task
and [CLS] token is
added at the beginning
of the sequence for the
next sentence
prediction task.
BERT

Architecture
BERT uses an encoder-only transformer architecture consisting
of four key modules:
● A tokenizer that converts text into integer sequences.
● An embedding layer that transforms tokens into vectors
embeddings.
● A transformer encoder stack with self-attention.
● A task head that produces probability distributions
over tokens.
BERT

BERT's Input Processing

BERT processes text in a special way:


● Every word gets three types of information:
1. The word itself (Token Embedding)
2. Which sentence it belongs to (Segment Embedding)
3. Where it appears in the sentence (Position Embedding)

Special markers are used:


● [CLS] at the start (like raising a flag saying "pay attention, new text coming")
● [SEP] between sentences (like a boundary marker)
● [MASK] for hidden words during training
Model Input Output Representation
The first token in each sequence is always a [CLS] token that is used in
classification.

● Separating Sentences apart can be done in two steps:


○ Separate between sentences using [SEP] token,
○ Adding a learned embedding to each token in order to state Each token is
which sentence it belongs to. represented by
summing the
corresponding
token, segment
and position
embedding as
seen in the figure

[Link]
BERT: Fine-tuning(Teaching BERT New Tasks)

BERT it can be fine-tuned for specific tasks( downstream tasks), like:


● Understanding if a movie review is positive or negative: Sentiment Analysis
● Finding answers in a text
● Identifying names and places in sentences: NER
● PIS tagging
This is like taking a well-educated student (pre-trained BERT) and giving them
specialized training for a specific job.

• (In contrast to ELMo, BERT’s fine-tuning process involves updating all the
model parameters (both the pre-trained encoder and the added task-specific
layers) during task-specific training.)
BERT
BERT: Overview

● BERT has been


pre-trained on Wikipedia(2.5 billion words)and BookCorpus( 800
million words),
it is fine-tuned on questions and answers datasets SQuAD.

● Researchers also compete over Natural Language Understanding with SQuAD


(Stanford Question Answering Dataset).

● BERT was one of the first models to achieve state-of-the-art


performance on the SQuAD benchmark,

● BERT now even beats the human reasoning benchmark on SQuAD.


BERT

Lots of the major AI companies are also building BERT versions:

● Microsoft extends on BERT with MT-DNN (Multi-Task Deep


Neural Network).
● RoBERTa from Facebook.
BERT: What Challenges Does BERT Help to Solve?

The Problem with Words


● The problem with words is that they’re everywhere. More and
more content is out there.
● Words are problematic because plenty of them are ambiguous,
polysemous, and synonymous.
● BERT is designed to help solve ambiguous sentences and phrases
that are made up of lots and lots of words with multiple meanings.
BERT: What Challenges Does BERT Help to Solve?
Ambiguity & Polysemy

● Almost every other word in the English language has multiple meanings. In spoken
word, it is even worse because of homophones and prosody.

● For instance, “four candles” and “fork handles” for those with an English accent.
Another example: comedians’ jokes are mostly based on the play on words because
words are very easy to misinterpret.

● It’s not very challenging for us because we have common sense and context so we
can understand all the other words that surround the context of the situation or the
conversation – but search engines and machines don’t.

● This does not bode well for conversational search into the future.
BERT

● Eg.
“I like the way that looks like the other one.”
By Stanford Part-of-Speech Tagger , “like” is considered to be two separate parts
of speech (POS).

● The word “like” may be used as different parts of speech including verb, noun,
and adjective.
● So literally, the word “like” has no meaning because it can mean whatever
surrounds it.
● The context of “like” changes according to the meanings of the words that
surround it.
● The longer the sentence is, the harder it is to keep track of all the different
parts of speech within the sentence.
BERT

BERT almost
have the same
model
architecture
across different
tasks with small
changes
between the pre-
trained
architecture and
the final
downstream
architecture
BERT

•The choice between encoder-only and decoder-only


models depends on the task and the type of context
required.
Generative AI and its Applications
LLM Architecture and Models

Dr. Arti Arya


Department of Computer Science and
Engineering
Generative AI and its Applications
•Transformers (2017): Introduced the fundamental architecture for LLMs.

•ELMo (2018): Leveraged context-sensitive embeddings using bi-directional


LSTMs.

•GPT (2018): Built on transformer architecture for text generation.

•BERT (2018): Utilized transformers bidirectionally for better context


understanding.

•RoBERTa (2019): Enhanced BERT with optimized training methods.

•BART (2019): Combined encoder-decoder frameworks from BERT and GPT.


GPT(Generative Pre-training Transformer)

The term "causal" in this context comes from the fact that the model
learns to predict the next token in the sequence based on the causal
influence of the previous tokens
GPT(Generative Pre-training Transformer)
GPT(Generative Pre-training Transformer)

• Following the similar idea of ELMo, OpenAI GPT (Radford et al., 2018), expands
the unsupervised language model to a much larger scale by training on a giant
collection of free text corpora.

• Despite of the similarity, GPT has two major differences from ELMo.

[Link] model architectures are different: ELMo uses a shallow concatenation of


independently trained left-to-right and right-to-left multi-layer LSTMs, while GPT is
a multi-layer transformer decoder.

[Link] use of contextualized embeddings in downstream tasks are different: ELMo


feeds embeddings into models customized for specific tasks as additional features,
while GPT fine-tunes the same base model for all end tasks.
GPT(Generative Pre-training Transformer)

❖Compared to the original


transformer architecture, the
transformer decoder model discards the
encoder part, so there is only one single
input sentence rather than two separate
source and target sequences.
❖This model applies multiple transformer
blocks over the embeddings of input
sequences.
❖Each block contains a masked multi-
headed self-attention layer and
a pointwise feed-forward layer.
❖The final output produces a distribution
over target tokens after softmax
normalization.
GPT(Generative Pre-training Transformer)

The loss is the negative log-likelihood, same as ELMo, but without


backward computation.

Let’s say, the context window of the size k is located before the
target word and the loss would be:
GPT
• As discussed earlier, GPT is
based on the Decoder
component of the
transformer architecture.
• At its heart, the model
processes input tokens through
multiple stacked transformer
layers, each containing self-
attention mechanisms and
feedforward neural networks.

References: Autoregressive means that the model's predictions are


Book: Build a Large Language Model (From Scratch)
conditioned on previous tokens.
GPT

References:
Book: Build a Large Language Model (From Scratch)
GPT: Key Components
• Token Processing
When text enters the system, it undergoes two crucial encoding steps:
● Token embeddings convert words into vector representations
● Positional encodings add sequence information to maintain word order
• Self-Attention Mechanism
The self-attention mechanism is what gives GPT its contextual understanding
capabilities. It processes input through three main components:
● Query vectors
● Key vectors
● Value vectors
These vectors work together to compute attention scores, allowing the model
to weigh the importance of different words in relation to each other.

References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
GPT Architecture

• GPT models are highly capable text completion models.


• They can perform a variety of tasks beyond text generation, such as:
● Spelling correction
● Text classification
● Language translation
This versatility is remarkable, considering that GPT models are trained on a relatively simple task:
next-word prediction.

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture

Next-Word Prediction Task


● The next-word prediction task is a form of self-supervised learning.
○ Self-supervised learning is a type of self-labeling where explicit labels for training
data are not required.
● Instead of manually labeling data, the structure of the data itself is used:
○ The next word in a sentence or document serves as the label, the model predicts.
● Advantages of this approach:
○ Labels are created "on the fly."
○ Enables the use of massive unlabeled text datasets for training.
● This approach allows GPT models to learn from vast amounts of data, improving their
performance and generalization

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture: A quick Review
● The GPT architecture is relatively simple compared to the original Transformer
architecture.

● It consists of only the decoder part of the Transformer model, without the
encoder.

● Key characteristics of GPT architecture:


○ It is a decoder-style model.

○ It generates text by predicting one word at a time, making it an

autoregressive model.

References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture

Autoregressive Models
● GPT models are considered autoregressive
because:
○ They incorporate their previous outputs as

inputs for future predictions.


○ Each new word is chosen based on the

sequence of words that precedes it.


● This autoregressive nature improves the
coherence of the generated text.

References:
Book: Build a Large Language Model (From Scratch)
GPT : Training Methodology
GPT models undergo a two-phase training process:
• Pre-training
During this phase, the model learns from vast amounts of text
data, developing an understanding of language patterns and
relationships.
• Fine-tuning
The model is then refined on specific tasks with human
feedback, improving its ability to generate contextually
appropriate responses
GPT

Figure:
Two Stage Process

References:
Book: Build a Large Language Model (From Scratch)
GPT: Recent Developments
• Modern GPT iterations have seen exponential growth in their capabilities.
GPT-4, o1 etc: while its exact architecture remains private, demonstrates
significant improvements in reasoning and contextual understanding
compared to its predecessors (multimodal)

Emergent Behaviour:
Emergent behavior refers to skills or abilities that a model develops naturally
during training, even though they were not directly taught or targeted.
Exposure to massive amounts of multilingual data in various contexts allows
models like GPT to "learn" translation patterns between languages. This means
GPT can perform translation tasks without being explicitly trained for them
Generative AI and its Applications
LLM Architecture and Models

Dr. Arti Arya


Department of Computer Science and
Engineering
Generative AI and its Applications
(UE22CS342BA9)

RoBERTa
RoBERTa: A More Robust BERT

• When BERT was first introduced, it revolutionized the field of NLP.

• Facebook AI proposed "RoBERTa: A Robustly Optimized BERT Pretraining


Approach," and shown how relatively simple modifications to BERT's training
process can lead to significantly better performance.

• RoBERTa (Robustly Optimized BERT Approach) emerged from a careful


replication study of BERT's pretraining process.

• The researchers found that BERT was significantly undertrained, and with
some key modifications, they could substantially improve its performance.

References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
RoBERTa: What Makes RoBERTa Different?
• 1. Longer Training:

While the original BERT was trained for a fixed number of steps, RoBERTa was trained for much longer,
allowing it to better learn from the training data. They also used larger batches during training, which
helped stabilize the learning process.

• 2. NSP Removal:

They removed what's called the "Next Sentence Prediction" (NSP) objective. Originally, BERT was
trained to predict whether two text segments appeared next to each other in the original text. The
researchers found this task wasn't necessary and might even be harmful to the model's performance.

• 3. Dynamic masking:

They introduced dynamic masking. In the original BERT, the masking pattern (which words get hidden
from the model) was static - created once during data preprocessing. RoBERTa uses dynamic masking,
where the masking pattern is generated everytime the model sees a sequence. This prevents the
model from memorizing specific patterns and forces it to learn more robust features.
References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
RoBERTa: The Data Advantage
RoBERTa wasn't just trained differently - it was trained on more data. The
researchers used multiple datasets totalling 160GB of text, including:

● BOOKCORPUS and Wikipedia (16GB)


● CC-NEWS (76GB)
● OPENWEBTEXT (38GB)
● STORIES (31GB)

The Results
The improvements paid off dramatically. RoBERTa achieved state-of-the-art
results on several key benchmarks:

● GLUE benchmark for natural language understanding


● SQuAD v1.1 and v2.0 for question answering
● RACE for reading comprehension References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
RoBERTa

• What's particularly interesting is that RoBERTa achieved these results


without any major architectural changes to BERT.

• This suggests that sometimes, the key to better performance isn't


inventing new model architectures, but rather optimizing the training
process of existing ones.

References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
Generative AI and its Applications
(UE22CS342BA9)

BART
BART: Bidirectional and Auto-Regressive Transformers

• BART represents a significant advancement in pre-trained language models,


combining the best aspects of BERT and GPT into a versatile seq-to-seq
architecture.

• Bart uses a standard seq2seq/machine translation architecture with a


bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).

References:
Paper: BART Paper
Bidirectional and Auto-Regressive Transformers
BART: Bidirectional and Auto-Regressive Transformers

Architecture
• BART employs a standard Transformer-based sequence-to-sequence model
with a bidirectional encoder and an auto-regressive decoder.

The model consists of:

Encoder: A bidirectional encoder similar


to BERT that processes corrupted input
text.

Decoder: A left-to-right autoregressive


decoder that generates output text,
similar to GPT

References:
Paper: BART Paper
BART: Pre-training Approach
BART's pre-training process involves two key steps:
1. Text Corruption: Documents are corrupted using
various noise functions:
● Token masking: Random tokens replaced with [MASK]
● Token deletion: Random tokens removed
● Text infilling: Text spans replaced with single [MASK]
token
● Sentence permutation: Sentences shuffled randomly
● Document rotation: Text rotated around random
token
2. Reconstruction:
The model learns to reconstruct the original text from
corrupted input References:
Paper: BART Paper
BART: Fine-tuning Capabilities
BART demonstrates remarkable versatility in fine-tuning for different tasks:
Sequence Classification
● Input is fed to both encoder and decoder
● Final decoder token state used for classification
Generation Tasks
● Direct fine-tuning for summarization and dialogue
● Autoregressive decoder enables natural text generation
Machine Translation
● BART can serve as a pre-trained decoder
● New encoder learned for source language

References:
Paper: BART Paper
Generative AI and its Applications
(UE22CS342BA9)

LLM Architecture
LLM Architecture
• Large Language Models (LLMs) represent a revolutionary advancement in artificial intelligence,
built upon transformers (sophisticated neural networks designed) to process and understand
human language.

• The architecture begins with tokenization, where text is broken into smaller units that the model
can process. These tokens are then transformed into numerical vectors through the embedding
layer, capturing semantic , syntactic and contextual information.

• At the heart of modern LLMs lies the transformer architecture, introduced in 2017.

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture

LLM architecture employs several key components:

● Embedding Layer: Converts input tokens into vectors with semantic


meaning
● Positional Encoding: Adds sequential information to token embeddings
● Self-Attention Mechanism: Weighs token importance across sequences
● Feedforward Neural Networks: Processes information through dense
layers

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture: Processing Mechanism

The processing flow in LLMs follows a sophisticated sequence:


1. Input Processing: Text undergoes tokenization and embedding
2. Attention Computation: Multiple attention heads operate in
parallel
3. Layer Processing: Information passes through multiple
transformer layers
4. Output Generation: The model predicts next tokens based on
learned patterns

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture

1. Encoder Network (Red Block)


● Input (X0, X1, X2):
○ The encoder takes a sequence of inputs (e.g., words, tokens, or
embeddings) labeled as X0, X1, X2.
○ These inputs are processed sequentially by the encoder network.
Each layer refines the embeddings by capturing more complex
patterns and contextual information.
● Functionality:
○ The encoder processes the input sequence and converts it into a
hidden state (a fixed-size representation of the input sequence).
○ This hidden state captures the meaning and context of the input
sequence.
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture

2. Hidden State (Green Block)

● The hidden state is the output of the encoder and


serves as the intermediate representation of the
input sequence.
● It acts as a bridge between the encoder and
decoder, transferring the learned information from
the input sequence to the decoder.

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture

3. Decoder Network (Blue Block)

● Output (Y0, Y1, Y2, Y3):


○ The decoder takes the hidden state from the encoder as input and
generates an output sequence (e.g., translated text, predicted tokens, etc.).
○ The outputs are labeled as Y0, Y1, Y2, Y3.
● Functionality:
○ The decoder generates the output sequence step by step,
often using the hidden state and previously generated
outputs as context.
○ For example, in machine translation, the decoder generates
the translated sentence one word at a time.
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture : Key Features of the Architecture
1. Sequential Processing:
○ Both the encoder and decoder process sequences step by step, making this
architecture suitable for tasks involving sequential data.

2. Contextual Representation:
○ The hidden state captures the context of the input sequence, enabling the
decoder to generate contextually relevant outputs.

3. Flexibility:
○ This architecture can handle variable-length input and output sequences,
making it versatile for many NLP tasks.

References:
Book: Build a Large Language Model (From Scratch)
Sinan Ozdemir - Quick Start Guide to Large Language Models_ Strategies and Best Practices for using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
How LLM Works
➢ Prompt Response Flow:

When a question is entered, it becomes part of a prompt within Chatgpt that interacts with pre-
trained LLM like GPT-4 to generate response.

➢ Pre-training of LLM

LLMs are pre-trained on vast amount of data at significant computational costs. They
utilize neural networks , weights and biases to improve model prediction.

➢ Fine Tuning vs Pre-training

Pre-training from scratch is costly and impractical for most users. Fine-tuning existing LLMs with
your data is a more viable option.

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
● The original Transformer (encoder-decoder)model, was specifically designed for
tasks like language translation.

● In contrast, GPT models use a simpler decoder-only architecture and are primarily
trained for next-word prediction. Despite this, GPT models are surprisingly
capable of performing translation tasks.

● This ability to handle translation was unexpected because GPT was not explicitly
trained for it. Instead, this capability is an example of what researchers call
"emergent behavior".
● .

References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
● Emergent behavior refers to skills or abilities that a model develops
naturally during training, even though they were not directly taught or
targeted.

• Like In "Avengers: Age of


Ultron," the Avengers must
confront Ultron, an AI they
created that unexpectedly
evolves into a powerful
adversary with its own
agenda, highlighting the
unpredictable nature of
advanced technology.).
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture

● In the case of GPT, its exposure to massive amounts of multilingual


data in various contexts allows it to "learn" translation patterns
between languages. This means GPT can perform translation tasks
without being explicitly trained for them.

● This emergent behavior highlights the power of large-scale


generative language models. They can handle a wide range of tasks,
like translation, without needing separate models for each task. This
versatility is one of the key advantages of GPT models.

References:
Book: Build a Large Language Model (From Scratch)
Generative AI and its Applications
(UE22CS342BA9)

DeepSeek-R1
DeepSeek-R1: Released on 20th Jan 2025
• DeepSeek R1 is an open-source AI model that stands out for its reasoning-
centric design.

• While many LLMs excel at Language Understanding, DeepSeek R1 goes a


step further by focusing on logical inference, mathematical problem-
solving, and reflection capabilities—features that are often guarded
behind closed-source APIs.

References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025

References:
DeepSeek R1:
All you need to
know )
DeepSeek-R1: Released on 20th Jan 2025

References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025
•1. Domain-Specific Optimization:
• DeepSeek-R1: Designed with industry-specific fine-tuning out-of-the-
box, making it highly effective for specialized tasks (e.g., healthcare,
finance, legal).
• ChatGPT: A general-purpose model that requires additional fine-tuning
for domain-specific applications, increasing time and resource costs.
•2. Computational Efficiency:
• DeepSeek-R1: Built with a compact and optimized architecture,
reducing computational costs for training and inference. It is ideal for
real-time applications and resource-constrained environments.
• ChatGPT: Requires significant computational resources due to its large-
scale architecture (e.g., 175B+ parameters), making it expensive to
deploy and maintain.
References:
DeepSeek R1: All you need to know )
DeepSeek-R1

•3. Ethical and Safety Features:


• DeepSeek-R1: Incorporates built-in ethical AI safeguards and compliance
features, ensuring safer and more reliable outputs for sensitive
applications.
• ChatGPT: While it includes some safety measures, it often requires post-
training adjustments to address biases and ethical concerns.

•4. Multimodal Capabilities:


• DeepSeek-R1: Supports multimodal inputs (text, images, audio, etc.),
enabling richer interactions and broader application potential.
• ChatGPT: Primarily text-based, with limited support for multimodal inputs
without additional modifications.

References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025

References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025

References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Usecases and Applications

References:
DeepSeek R1: All you need to know )

You might also like