U1 Merged
U1 Merged
Unit 1
(UE22CS342BA9)
NLP Basics and Word Embeddings
The slides are prepared from various resources from the Universities from abroad and
India. Also, some material is taken from reliable resources from internet throughout this
course. Some slides are incorporated from NLP course. The slides are compiled by Gen AI
TA Sai Yashwanth and Dr. Pooja Agarwal and few inputs by Dr. Arti Arya.
Generative AI and Its Applications
Course Content
GenerativeAI and Its Applications
Course Content
GenerativeAI and Its Applications
Evaluation Policy
GenerativeAI and Its Applications
Evaluation Policy
GenerativeAI and Its Applications
Introduction to GenAI
• This revolutionary advancement builds upon decades of research, enabling machines to not
just analyze but create original content across multiple domains – from writing and coding to
creating art and music.
• At its core, Generative AI is about creation. Unlike traditional ML systems that excel at specific
tasks, generative models can produce entirely new content.
• These systems understand patterns in their training data and use this understanding to
generate original outputs. Examples: Stable Diffusions, GAN, Transformer models.
• Generative AI models are trained on massive datasets using techniques like unsupervised
learning or self-supervised learning. They learn to predict the next element in a sequence
(e.g., the next word in a sentence or the next pixel in an image). Over time, they develop a
deep understanding of the structure and nuances of the data, enabling them to generate
outputs that are not mere reproductions but entirely new creations.
For example:
References:
[Link]
GenerativeAI and Its Applications
Introduction to GenAI : Basic Model
Notable Examples:
•Stable Diffusion: A
widely-used diffusion
model for text-to-image
generation.
•DALL·E 2: Combines
diffusion models with
transformers for text-to-
image tasks.
Generative AI and its Applications
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
Simple timeline which led to LLMs as we see them today
These models are designed to run efficiently on smaller hardware or fine-tune specific tasks
while maintaining good performance.
GenerativeAI and Its Applications
LLM Basic and Evolution
Local LLMs and Open Source LLMs:
● Local LLMs run on personal hardware without cloud dependency
● Open-source models allow customization and transparency
● Growing ecosystem of community-driven development
Ollama is a popular open-source framework that simplifies running and managing
various LLMs locally on your machine
Some basic examples:
LLMs: gpt family, gemini family, claude family, and many more
Multimodal LLMs: llava, llama3.2-vision
Small LLMs: llama 3.2 1b and 3b, many more
References:
Sebastian Raschka blog on multimodal llms Check out Ollama
GenerativeAI and Its Applications
LLM Basic and Evolution
How to build an LLM?
The process of creating an LLM involves two main stages: pretraining and finetuning.
1. Pretraining
● Definition:
○ Pretraining is the initial phase where the model is trained on a large, diverse corpus of
text data (referred to as "raw text").
○ The goal is to develop a broad understanding of language by predicting the next word in
a sequence (next-word prediction).
References:
Book: Build a Large Language Model (From Scratch)
Generative AI and Its Applications
LLM Basic and Evolution
● Dataset:
○ The dataset used for pretraining is typically massive and diverse, containing billions
of tokens from books, articles, websites, and other text sources.
○ Filtering may be applied to remove irrelevant or low-quality data (e.g., formatting
characters, unknown languages).
● Output:
○ The result of pretraining is a base model or foundation model that has a general
understanding of language.
○ Example: GPT-3, which can perform text completion and has limited few-shot
learning capabilities.
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
2. Finetuning
● Definition:
○ Finetuning is the process of refining the pretrained model on a smaller, labeled
dataset that is specific to a particular task or domain.
○ This step adapts the general-purpose model to perform well on specific tasks.
● Types of Finetuning:
○ Instruction-Finetuning:
■ The labeled dataset consists of instruction-answer pairs.
■ Example: Training the model to translate text by providing queries and
their corresponding translations.
○ Classification Finetuning:
■ The labeled dataset consists of text and associated class labels.
■ Example: Training the model to classify emails as spam or non-spam.
● Output: References:
Book: Build a Large Language Model (From Scratch)
○ A specialized LLM that is optimized for a specific task or domain.
GenerativeAI and Its Applications
LLM Basic and Evolution
● Pretraining:
○ Provides the model with a
broad understanding of
language, enabling it to
generalize across tasks.
● Finetuning:
○ Adapts the model to specific
tasks or domains, improving
its performance in those
areas.
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
LLM Basic and Evolution
References:
Check out this video
Generative AI and its Applications
▪ How about
▪ They picnicked by the pool, then lay back on the grass
and looked at the stars.
▪ 18 tokens (again counting punctuation)
▪ But we might also note that “the” is used 3 times, so there
are only 16 unique types (as opposed to tokens).
▪ In going forward, we’ll have occasion to focus on
counting both types and tokens of both words and N-
grams.
GenerativeAI and Its Applications
Tokenization
• Tokenization is the process of
• breaking up the sequence of characters in a text by locating the
word boundaries, the points where one word ends and another
begins.
Ambiguities
Anaphoric
Lexical
Syntactic Semantic
GenerativeAI and Its Applications
Lexical Ambiguity
Lexical Ambiguity:
• Related to words.
• This type of ambiguity represents words that can have multiple
assertions.
• For instance, in English, the word “back” can be a noun ( backstage),
an adjective (back door), or an adverb (back away).
verb noun
Eg.
“ I heard his cell phone ring in my office”.
• another way that modifies the verb.(Here, "in my office" describes where
you were when you heard the cell phone ring.)
GenerativeAI and Its Applications
Syntactic Ambiguity
• Another Example:
• Mary ate a salad with spinach from California for lunch
on Tuesday.“
• Different meanings
• Mary ate a salad that contained spinach, and the spinach was
sourced from California. She had this salad for lunch on Tuesday.
• Mary ate a salad, and she also had spinach from California as a
side dish. She had this meal for lunch on Tuesday.
GenerativeAI and Its Applications
Syntactic Ambiguity
• Another Example:
• Mary ate a salad with spinach from California for lunch
on Tuesday.“
• "with spinach" can attach to "salad" or "ate“,
• "from California" can attach to "spinach", "salad", or "ate".
• "for lunch" can attach to "California", "spinach", "salad", or
"ate"
• and "on Tuesday" can attach to "lunch", "California", "spinach",
"salad" or "ate".
• Nonetheless there are 42 possible different parse trees
for this sentence.
GenerativeAI and Its Applications
Semantic Ambiguity
Semantic Ambiguity:
Related to the interpretation of sentence. OR
How you interpret the meaning of entire sentence.
• Eg.,
• I heard his cell phone ring in my office can be interpreted as if “I was
physically present in the office” or as if “the cell phone was in the
office”.
• Lucy owns a parrot (existentially quantified) that is larger than a
cat (either universally quantified or means "typical cats")
Another Example
• "The dog is chasing the cat." vs. "The dog has been domesticated for
10,000 years."
• In the first sentence, "The dog" means to a particular dog;
• In the second, it means the species "dog".
GenerativeAI and Its Applications
Anaphoric Ambiguity
• Anaphoric ambiguity
• A phrase or word refers to something previously mentioned, but
there is more than one possibility for machine to understand that
word.
• Eg.
• "Margaret invited Susan for a visit, and she gave her a good lunch."
(she = Margaret; her = Susan)
• "Margaret invited Susan for a visit, but she told her she had to go to
work" (she = Susan; her = Margaret.)
• "On the train to Boston, George chatted with another passenger. The
man turned out to be a professional hockey player."
• (The man = another passenger).
GenerativeAI and Its Applications
Metonymy Ambiguity
Metonymy:
Eg.,
Samsung is screaming for new management.
Word Embeddings
Generative AI and Its Applications
One-Hot Encodings
• Word vectors are vectors of weights. Say there are some dimensions
and these vectors are defining those words in these n dimensions.
• In a simple 1-of-N (or one-hot) encoding every element in the vector
is associated with a word in the vocabulary
To train embeddings, words are often converted to one-hot encodings, which
are then passed through an embedding layer in the neural network to learn
dense vector representations.
• With this encoding,
Amongst all the computing semantic
Hotel [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ]
dimensions, only one Motel[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ] similarity is difficult.
dimension is 1 and all
other dimensions are • Sim(hotel, motel)=0
0 and this dimension
may correspond to Each word in the vocabulary is represented as a sparse vector
the index of the word with a single "1" corresponding to its index in the vocabulary
and "0s" elsewhere.
46
GenerativeAI and Its Applications
Limitations of One-Hot Encodings
Motivation=
47
GenerativeAI and Its Applications
Distributional Representation of words
• Dense Vectors are good at capturing synonyms and semantic similarity.
• When we talk about word embeddings( word vector) being d-dimensional, which may vary
from 50-1000. These d-dimensions don’t have very clear interpretation.
• Let’s consider the 5-dimensions as Royalty, Masculine, Feminine, Age, Height etc and define
some words based on it:
• King. Queen. Man
References:
Book: Build a Large Language Model (From Scratch)
Generative AI and Its Applications
Word Embeddings
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings
Vector Representation
● Words are represented as dense vectors in a continuous space
(Basically array of numbers)
● Similar words cluster together in this space.
● Easy for neural networks to perform operations.
● Example: "king" - "man" + "woman" ≈ "queen"
● Dimensions (features) can range from 2 to thousands
○ GPT-2 (small): 768 dimensions (In GPT-2, each token (word
or subword) in the input text is represented as a vector of
size 768 in the model's internal representation. Refers to the
size of embedding space in transformer’s model.)
References:
○ GPT-3 (large): 12,288 dimensions
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings
Contextual Understanding
References:
Book: Build a Large Language Model (From Scratch)
GenerativeAI and Its Applications
Word Embeddings
• Definition
• Real-valued and sub-symbolic representations of words as dense numeric
vectors. So, word embeddings are dense numeric vectors.
• Distributed representation of word meanings (not count-based on
frequency of word)
• Usually learned with neural networks.
• Specific dimensions of the resulting vectors cannot be directly mapped to
symbolic representation.
• Models that seek to predict between a center word and context words
(predict models)
• Key elements of deep learning models.
55
56
57
58
59
60
61
62
GenerativeAI and Its Applications
Neural Word embeddings
63
GenerativeAI and Its Applications
Word2vec
• is a machine learning algorithm for generating dense, distributed vector
representations (embeddings) of words, such that words with similar meanings
are positioned close to one another in the vector space. Given by Google in 2013.
• Both of these are not Deep neural networks, they are shallow neural networks
which map word(s) to the target variable which is also a word(s).
• Both of these techniques learn weights which act as word vector representations.
64
GenerativeAI and Its Applications
Word2vec
▪ In contrast to language models( which predicts the next word),
embedding models consider the history (previous words) and the
future (following words) of a center word.
• Models like BERT and GPT generate different embeddings for the same word
depending on its context within a sentence.
• This allows them to handle polysemy (words with multiple meanings) more
effectively.
67
Generative AI and Its Applications
Parts of Speech Tagging (PoS)
GenerativeAI and Its Applications
POS (parts of speech)
▪ Parts of speech (POS) are useful because they reveal a lot about a
word and its neighbors.
▪ Knowing a word POS(noun or verb, …) tells us about
▪ likely neighboring words (nouns are preceded by determiners and adjectives,
verbs by nouns) and
▪ syntactic structure (nouns are generally part of noun phrases), making POS
tagging a key aspect of parsing.
▪ Parts of speech are useful features
▪ for labeling named entities(NER) like people or organizations in information
extraction, or
▪ for coreference resolution(the task of finding all expressions that refer to the
same entity in a text)
▪ for sentiment analysis, question answering, and word sense disambiguation.
69
GenerativeAI and Its Applications
POS (parts of speech) examples
70
GenerativeAI and Its Applications
POS (parts of speech)
71
GenerativeAI and Its Applications
POS (parts of speech)
72
GenerativeAI and Its Applications
POS (parts of speech)
• A corpus is a large and structured set of machine-readable texts that have been
produced in a natural communicative setting.
• Its plural is corpora.
• They can be derived in different ways like text that was originally electronic,
transcripts of spoken language and optical character recognition, etc.
• Corpora are generally solely used for statistical linguistic analysis and hypothesis
testing.
• Eg Brown Corpus, The Brown Corpus, also known as the Brown University Standard
Corpus of Present-Day American English, is a collection of text samples from a wide
range of sources, compiled in the 1960s. It was one of the first major text corpora
created for linguistic research and has been widely used in the field of computational
linguistics.
• British National Corpus( 100 million words representing British English) etc 74
GenerativeAI and Its Applications
Different Corpora and Tree Banks
75
GenerativeAI and Its Applications
Different Corpora and Tree Banks
TreeBanks
• is a linguistic resource that consists of a large collection of sentences annotated
with syntactic or semantic structure.
• Linguistically parsed text corpus that annotates syntactic or semantic sentence structure.
• Geoffrey Leech coined the term ‘treebank’, which represents that the most common way
of representing the grammatical analysis is by means of a tree structure.
• Generally, Treebanks are created on the top of a corpus, which has already been
annotated with part-of-speech tags.
76
GenerativeAI and Its Applications
POS (parts of speech)
▪ What are POS tags?
POS tags are also known as word classes, morphological classes, or lexical tags
▪ Number of tags used by different systems/corpora/languages are different
▪ Penn Treebank (Wall Street Journal Newswire): 45 tags
▪ Brown corpus (Mixed genres like fiction, biographies, etc): 87 tags
▪ Lancaster UCREL C5: 61 tags
▪ Lancaster C7: 145 tags
78
GenerativeAI and Its Applications
POS tagging
79
GenerativeAI and Its Applications
POS tagging
80
GenerativeAI and Its Applications
Why is POS tagging hard?
• Ambiguity
1. “Plants/N need light and water.”
“Each one plant/V one.”
2. “Flies like a flower”
▪ Flies: noun or verb?
▪ like: preposition, adverb, conjunction, noun, or verb?
▪ a: article, noun, or preposition?
▪ flower: noun or verb?
81
GenerativeAI and Its Applications
Why is POS tagging hard?
82
GenerativeAI and Its Applications
3 approaches for POS tagging
1. Rule-based tagging
• The ENGTWOL tagger (Voutilainen, 1995) is a rule- based tagger
based on two-stage architecture.
3. Transformation-based tagging
• Brill tagger
83
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
▪ Stage 1:
▪ Start with a dictionary of Tagsets
▪ Assign all possible tags to words from the dictionary exploiting
morphological/orthographic rules.
▪ Stage 2:
▪ Write rules by hand to selectively remove tags.
▪ Disambiguation is done by analyzing the linguistic features of the word,
its preceding word, its following word and other aspects.
▪ For example, if the preceding word is an article then the word in
question must be noun. This information is coded in the form of rules.
▪ Leaving the correct tag for each word.
Example of rules: NP → Det (Adj*) N
For example: the clever student 84
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
Example 1:
Rule based : Start with a dictionary
Personal pronoun Past participle
• she: PRP Past tense verb
• promised: VBN,VBD
Adverb
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB
85
GenerativeAI and Its Applications
Rule-based tagging (two stage - architecture)
Example 1: Write rules to eliminate tags
R1: Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
86
GenerativeAI and Its Applications
2. Stochastic tagging
The tag encountered most frequently with the word in the training set is
the one assigned to an instance of that word.
88
GenerativeAI and Its Applications
Stochastic tagging
89
GenerativeAI and Its Applications
3. Transformation-based tagging
90
GenerativeAI and Its Applications
Transformation-based tagging
• Basic idea:
• First use frequency, then revise it using contextual rule.
• The Brill tagger was described and invented by Eric Brill in his 1993.
It can be summarized as an "error-driven transformation-based tagger".
91
GenerativeAI and Its Applications
Transformation-based tagging
92
GenerativeAI and Its Applications
Transformation-based tagging
• Example:
• It is expected to race tomorrow. The race for outer space.
• Tagging algorithm:
1. Tag all uses of “race” as NN (most likely tag in the Brown
corpus having approx. one million words.)
• It is expected to race/NN tomorrow
• the race/NN for outer space
2. Use a transformation rule to replace the tag NN with VB for
all uses of “race” preceded by the tag TO:
• It is expected to race/VB tomorrow
• the race/NN for outer space
93
GenerativeAI and Its Applications
94
GenerativeAI and Its Applications
POS Tag using HMM
95
GenerativeAI and Its Applications
POS Tag using HMM
96
GenerativeAI and Its Applications
POS Tag using HMM
97
GenerativeAI and Its Applications
POS Tag using HMM
98
GenerativeAI and Its Applications
POS Tag using HMM
99
GenerativeAI and Its Applications
POS Tag using HMM
100
GenerativeAI and Its Applications
POS Tag using HMM
101
GenerativeAI and Its Applications
POS Tag using HMM
102
GenerativeAI and Its Applications
POS Tag using HMM
103
GenerativeAI and Its Applications
POS Tag using HMM
104
GenerativeAI and Its Applications
POS Tag using HMM
105
GenerativeAI and Its Applications
POS Tag using HMM
106
GenerativeAI and Its Applications
POS Tag using HMM
107
GenerativeAI and Its Applications
POS Tag using HMM
108
GenerativeAI and Its Applications
POS Tag using HMM
109
GenerativeAI and Its Applications
Named Entity Recognition(NER)
GenerativeAI and Its Applications
Information Extraction (IE)
111
GenerativeAI and Its Applications
An Example Information Extraction
112
GenerativeAI and Its Applications
Applications of IE
113
GenerativeAI and Its Applications
Classic Task of IE
Classic Tasks of IE
➢NER
➢Co-reference Resolution
➢Relation Extraction
➢Event Extraction
114
GenerativeAI and Its Applications
Classic IE Tasks : Named Entity Recognition
115
GenerativeAI and Its Applications
Classic IE Tasks: Co-reference Resolution
117
GenerativeAI and Its Applications
Classic IE Tasks : Event Extraction
118
GenerativeAI and Its Applications
Applications?
1. An understanding of the Named Entities involved in a document provides
much richer analytical frameworks and cross-referencing.
2. NER is extensively used in QnA systems, document clustering and text
analytics applications.
3. In Sentiment analysis/ opinion mining, one might want to know a
consumer’s sentiment/opinion toward a particular entity.
4. Named entity tagging is also central to Natural Language Understanding
tasks of building semantic representations, like extracting events and the
relationship between participants.
5. Automation of customer support : Automatically tagged locations and
product names can help smoothly route customer queries to right location
and people in a company with multiple branches and many employee
119
GenerativeAI and Its Applications
What is Named Entity and Named Entity Recognition
• A named entity is anything that can be referred to with a
proper name: a person, a location, an organization.
Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American
Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim
Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME
Thursday] and applies to most routes where it competes against discount carriers, such as
[LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].
121
GenerativeAI and Its Applications
NER
• But for NER one has to find and label spans of text, and its difficult
because of the ambiguity of segmentation; we need to decide
what’s an entity and what isn’t, and where the boundaries are.
123
GenerativeAI and Its Applications
Named Entity and not a Named entity
124
GenerativeAI and Its Applications
Some problems in identifying NE
• Variation of NE (same entity in different form).
• Mahatama Gandhi, Gandhi , Bapu
• Ambiguity of NE types:
• 1945 (date vs. time)
• Washington (location vs. person)
• May (person vs. month)
• Tata (person vs. organization)
• Person vs Location
• Sir C. P Ramaswamy was the Divan of Travancore (Per)
• Sir C.P Ramaswamy Road is in Chennai (Loc)
• Person vs Organization
• Anil Ambani opened Reliance Fresh (Per)
• Reliance Fresh is under Anil Amabani Group Ltd (Org) 125
GenerativeAI and Its Applications
Tagset for Named Entity
• CLIA tagset
• CLIA-is Hierarchical -Similar to ACE
• Developed for two domains i.e. Tourism and Health
126
GenerativeAI and Its Applications
Named entity types
TIMEX
127
GenerativeAI and Its Applications
How to Annotate
• [Link]
• 1.1 Person
• 1.1.1 Individual
• These refer to names of each individual person,
• Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”> abc </ENAMEX>
Examples:
English:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL”>Abdul
Kalam</ENAMEX>
128
GenerativeAI and Its Applications
Annotation continued
Family Name
• In general we find that a person name consists of a family name.
Whenever an instance of individual name occurs with family name, then
that part of the name, which refers to family name, must be tagged
specifically with subtag “FAMILYNAME” as shown below.
Tag Structure:
<ENAMEX TYPE= “PERSON” SUBTYPE_1= “INDIVIDUAL” >abc
<ENAMEX TYPE= “PERSON” SUBTYPE_2= “FAMILYNAME”> abc
</ENAMEX>
Examples:
English:
<ENAMEX TYPE=”PERSON” SUBTYPE_1=”INDIVIDUAL”> Lalu
Prasad<ENAMEX TYPE= “PERSON”
SUBTYPE_2=“FAMILYNAME”>Yadav</ENAMEX>
129
GenerativeAI and Its Applications
130
GenerativeAI and Its Applications • Manmade
• Religious Places
TAGSET • Roads/Highways
• Museum
• ENAMEX
• Theme parks/Parks/Gardens
• Person
• Monuments
• Individual
• Facilities
• Family name • Hospitals
• Title • Institutes
• Group • Library
• Organization • Hotel/Restaurants/Lodges
• Government • Plant/Factories
• Public/private company • Police Station/Fire Services
• Religious
• Public Comfort Stations
• Non-government
• Airports
• Political Party • Ports
• Para military • Bus-Stations
• Charitable • Locomotives
• Association • Artifacts
• GPE (Geo-political Social Entity) • Implements
• Media • Ammunition
• Location • Paintings
• Place • Sculptures
• District • Cloths
• City • Gems & Stones
• State • Entertainment
• Dance
• Nation • Music
• Continent • Drama/Cinema
• Address • Sports
• Water-bodies • Events/Exhibitions/Conferences
• Landscapes • Cuisine’s
• Celestial Bodies
• Animals
• Plants
131
GenerativeAI and Its Applications
Enamex types
132
GenerativeAI and Its Applications
Entity name types (ENAMEX Subtypes)
1. Persons are entities limited to humans. Individual refer to names of
each individual person. Group refers to set of individual
2. Location entities are limited to geographical entities such as
geographical areas like names of countries, cities, continents and
landmasses, bodies of water, and geological formations.
3. Organization entities are limited to corporations, agencies, and other
groups of people defined by an established organizational structure
4. Entertainment entities denote activities, which divert and hold
human attention or interest, giving pleasure, happiness, amusement
especially performance of some kind such as dance, music, sports,
events.
En: [Robin]PERSON is working at [HCL]ORGANIZATION , which is in
[Chennai] LOCATION
En: [Flower Exhibition] ENTERTAINMENT is held at [Hyderabad]LOCATION
133
GenerativeAI and Its Applications
Entity name types (ENAMEX Subtypes)
134
GenerativeAI and Its Applications
Entity name types
135
GenerativeAI and Its Applications
Numerical expression
136
GenerativeAI and Its Applications
Numerical expressions
138
GenerativeAI and Its Applications
Temporal Expressions
➢ Temporal expressions are the entities refers to time, date, year, month and day
➢ Time: These refer to expressions of time, includes different forms
➢ of expressing time. This also includes Hours, minutes and seconds.
➢ 5’o clock in the morning
➢ 9.30 a.m.
➢ Evening 6.30 p.m.
➢ Date: This refers to expressions of Date such as 13/12/2001 etc in
➢ different forms. This also includes month, date and year
➢ August 15 1947
➢ 1956
➢ September 11
139
GenerativeAI and Its Applications
Temporal Expressions
140
GenerativeAI and Its Applications
Nested or embedded entity
141
GenerativeAI and Its Applications
Approaches for NER
142
GenerativeAI and Its Applications
Dictionary (Gazetteers) Look-up Approach
143
GenerativeAI and Its Applications
Rule Based NER
• A rule-based system consists of :
• Collection of rules
• A set of policies to control firing of multiple rules (The rule
getting triggered at one particular time)
• Create regular expressions to extract:
• Telephone number
• E-mail
• Capitalized names etc.
• Blocks of digits separated by hyphens
RegEx = (\d+\-)+\d+
145
GenerativeAI and Its Applications
Rule Based NER
146
GenerativeAI and Its Applications
Rule Based Approach pros and cons
• Advantages:
• Rich and expressive rules
• Good results
• Disadvantages:
• Requires huge experience and grammatical knowledge
• Experts to craft rules are expensive
• Highly domain specific ( not portable to a new domain)
148
GenerativeAI and Its Applications
Results for NE Detection ( Evaluation of NER)
Named entity recognizers are evaluated by recall, precision, and F1 measure
Recall is the ratio of the number of correctly labeled responses to the total
that should have been labeled;
precision is the ratio of the number of correctly labeled responses to the
total labeled;
and F-measure is the harmonic mean of the two.
149
GenerativeAI and Its Applications
Typical NER systems
1. The typical architecture for an information extraction system
begins by
segmenting,
tokenizing, and
part-of-speech tagging the text.
• The Bayesian approach is to classify the new instance is to assign the class label that is
the most probable value, VMAP , given attribute values (a1.....an) that describe the
instance is
GenerativeAI and Its Applications
Naive Bayes Classifier
VMAP ----------------------(1)
• Naive Bayes Classifier is based on simplifying assumption that the attribute values
(i.e. a1, a2, a3, …, an)are conditionally independent given the target value
• i.e the assumption is that given the target value of the instance, the probability of
observing the conjunction a1,a2,a3,.....an is just the product of the probabilities for
the individual attributes
V MAP =
GenerativeAI and Its Applications
Naive Bayes Classifier
Laplace Estimation
GenerativeAI and Its Applications
Naïve Bayes Classifier
GenerativeAI and Its Applications
Special Case( M-estimate of Conditional Probability)
GenerativeAI and Its Applications
Special Case( M-estimate of Conditional Probability)
GenerativeAI and Its Applications
➢ consider the given data set AND the task is to classify the sentence
➢ “A very close game” as sports or not sports
➢ In this data set we do not have numbers but we have only text
➢ We need to convert all this text into numbers that we can use for
calculation. HOW?????
sentence class
➢ One solution is to use frequency of words
A great game sports
➢ Ignore word order and sentence construction
➢ Treat every document as a set of words it contains. The election is over Not sports
➢ Now the feature used in this case is the counts of words i.e(words very clean match sports
frequency) a clean but sports
➢ It’s a simplistic approach, but works surprisingly well forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example- (Text Classification)
GenerativeAI and Its Applications
Example-Text Classification
➢ since in our classier ,we are just trying to find out which category sentence class
has bigger probability we can discard the divisor A great game sports
➢ This is same for both the categories
The election is over Not sports
➢ we can compare
➢ P(A very close game/sports) x P(sports) very clean match sports
➢ with a clean but sports
➢ P(A very close game/not sports) x P(not sports) forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example -Text Classification
sentence class
➢ The probabilities can be calculated: A great game sports
1. count how many times the sentence
The election is over Not sports
'A very close game' appears in sports category
2. Divide by the total very clean match sports
3. obtain P(a very close game|sports) a clean but sports
forgettable game
➢ PROBLEM: we do not have the 'sentence' in the it was a close election not sports
training set
=>probability is zero
➢ unless every sentence appears in the training set, what
we want to classify, the model won’t classify
GenerativeAI and Its Applications
Example -Text Classification
➢ we assume that every word in a sentence is independent of the
other ones
➢ no longer we will look for entire sentences, but for only
words(individual)
CALCULATING PROBABILITIES
➢ The final step is just to calculate every probability and see which one
turns to be larger
➢ First: calculate a priori probability for each category,i.e for the sentence
given in the training set
P(sports)=3/5=0.6
P(not sports)=2/5=0.4
➢ calculate P(game/sports) : counting number of times the word game
appears in the sports sample, divided by the total no of words in sports sentence class
i.e it appears twice for 11 words. A great game sports
P(game/sports)=2/11=0.18181
The election is over Not sports
➢ A problem again very clean match sports
➢ the word close does not appear in any sports ,and would lead us 0 when a clean but sports
multiplied with other probability forgettable game
it was a close election not sports
GenerativeAI and Its Applications
Example-Text Classification
sentence class
To resolve this we do something called Laplace smoothing
A great game sports
➢ Add 1 to every count so its never zero
➢ To balance it again, add no of possible words to divisor, The election is over Not sports
➢ in our case the possible words are: very clean match sports
{a, great, game, the election, is, over,.......election}= 15
a clean but sports
Applying smoothing we get:-
forgettable game
it was a close election not sports
WORD P(word/sports) P(word/Not
sports)
a (2+1)/(15+11)=3/26 (1+1)/(9+15)=2/24
Now find if it belongs to sports or not sports class using naïve bayes:-
P(a/sports)xP(very/sports)xP(close/sports)xP(game/sports) P(Sports) =
As you increase the number of grams the context become more relevant but the
complexity increases
➢ TFIDF
term frequency–inverse document frequency, is a numerical statistic
that is intended to reflect how important a word is to a document in a collection or
corpus.
GenerativeAI and Its Applications
Advanced Techniques
GenerativeAI and Its Applications
We have covered:
LLM Basics and evolution
NLP: Word Embeddings
POS and NER
Text Classification
We will now cover:-
ELMo, Transformer Anatomy, GPT
BERT,ROBERTa, BART architectures
LLM Architecture
Generative AI and its Applications
(UE22CS342BA9)
ELMo
References:
Generalized Language Models | Lil'Log
ELMo: Architecture and Working
Bidirectional Processing
❑ The model begins with character-level tokenization,
converting words into character embeddings before
processing them through:
➢ A forward pass capturing preceding context.
➢ A backward pass capturing subsequent context.
➢ Two LSTM layers with residual connections
ELMo: Architecture and Working
Bidirectional Processing
• Given a sequence of N tokens , a forward language
model(LM) computes the probability of the sequence by modeling the
probability of token tk given the history :
Context-Aware Understanding
Dynamic Word Representations:
Unlike traditional word embedding methods like Word2Vec or GloVe, ELMo
generates different representations for the same word based on its context. For
example, the word "trust" receives distinct embeddings in phrases like:
● "I can't trust you"
● "He has a trust fund"
This context sensitivity allows ELMo to handle:
● Polysemy (multiple word meanings)
● Complex linguistic patterns
● Contextual nuances
ELMo
● The core parameters of the pre-trained ELMo model (i.e., the weights of
the bidirectional language model) are kept fixed during the fine-tuning
phase. This ensures that the rich contextual information captured during
pre-training is preserved.
● A task-specific projection matrix is introduced, which can be adjusted
during fine-tuning. This matrix maps the ELMo embeddings to a space
that is more suitable for the specific task at hand. By adjusting this matrix,
the model can better align the pre-trained embeddings with the
requirements of the downstream task.
ELMo: Fine-tuning Phase
Fine-tuning Phase
Limitations:
1. Slower and harder to parallelize than transformer-based models like BERT,
GPT etc.
2. Generates context from sequential forward and backward pass unlike
transformer-based models which uses self-attention to simultaneously
consider all tokens.
3. ELMo’s embeddings are large and computationally expensive due to their
combination of multiple BiLM layers.
NATURAL LANGUAGE PROCESSING
ELMO
THANK YOU
Generative AI and its Applications
(UE22CS342BA9)
Transformer Anatomy
“As the animal entered the forest, … blending it into the shadows of
the towering trees.”
• RNN’s have a short reference window; if the input is too big, “it” ‘s
reference will not be stored in the context window.
• RNN’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
• LSTM’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
• Attention window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
Attention
Picture Credit
Attention
• This dot product gives a measure of similarity between the two vectors,
which translates to how much attention one word should give to the other.
Transformer Architecture
Transformer is a deep learning model introduced in the 2017 paper "Attention Is All You Need". It has
become the foundation for most modern large language models (LLMs) due to its efficiency and
effectiveness in handling sequential data, such as text.
Originally, it was designed for machine translation, where it translated text from one language (e.g.,
English) to another (e.g., German or French).
The Transformer consists of two main parts:
● Encoder: Processes the input text and converts it into a numerical representation (embedding).
● Decoder: Uses the encoded representation to generate the output text, one word at a time.
Transformer Anatomy
Transformer Architecture
A simplified diagram
taken from:
Book: Build a Large Language Model (From Scratch)
References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy
Transformer Architecture
References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy
References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy
References:
Book: Build a Large Language Model (From Scratch)
Transformer Anatomy
Self-Attention Mechanism
● A key innovation of the Transformer is the self-attention mechanism.
● What it does:
○ It allows the model to weigh the importance of different words in a sequence relative to
each other.
○ For example, in the sentence "The cat sat on the mat," the word "cat" is more
important to "sat" than "mat" when determining the meaning of the sentence.
Why it matters:
○ It enables the model to capture long-range dependencies (e.g., relationships between
words far apart in a sentence).
○ This improves the model's ability to understand context and generate coherent,
contextually relevant output.
Architecture of Transformers
• Embeddings
• Positional Encoding
Encoder Components-
• Multi-Head (Self) Attention
• Fully Connected Feed Forward Network
• Residual Connections & Layer Normalization
Decoder Components-
• Masked Multi-Head (Self) Attention
• Multi-Head (Cross) Attention
Key Ideas:
Self-Attention: Each token in a sequence focuses on all other tokens to
capture relationships and context, regardless of their position.
Query, Key, and Value: These are vector representations of input
tokens. Attention scores are computed as a function of queries and keys,
determining how much "attention" each token pays to others.
Weighted Summation: The final output is a weighted sum of the values,
based on the computed attention scores.
.
Attention and Attention Mechanism
Attention:
➢ Refers to the ability of a model to focus on different parts of the input sequence when
making predictions, especially in sequence-to-sequence tasks like machine translation.
➢ It determines how much weight or importance each element in the input sequence
should have when predicting each element of the output sequence.
Attention Mechanism:
➢ This is used to take care of LONG RANGE DEPENDENCIES.
➢ This is the actual computational framework or algorithm that enables attention. It is the
mathematical structure that allows the model to compute which parts of the input
sequence should be given more importance (i.e., higher weights) when making a
decision at each step.
➢ It is realized through a scaled dot-product attention, where the input tokens are
transformed into query, key, and value vectors, and the attention score is computed
using the dot product of the query and key vectors.
Encoder and Decoder
Encoder is responsible for
➢ processing the input sequence
➢ converting it into a set of continuous representations (or embeddings)
➢ consists of N identical layers, each of which has two main sub-components:
➢ Self-Attention Layer: This layer allows the encoder to consider the entire sequence of input
tokens when computing its representation.
➢ Every token in the input can "attend" to all other tokens in the sequence, capturing
dependencies and relationships regardless of distance between tokens.
➢ This is done through the scaled dot-product attention mechanism.
The self-attention mechanism works by computing three vectors for each token in the
sequence:
•Query (Q): Represents the current token.
•Key (K): Represents all tokens.
•Value (V): Represents the information that will be passed through
Attention is computed by taking the dot product of the query and key vectors, scaling the result, and
applying a softmax function to generate attention weights. These weights are used to compute a
weighted sum of the value vectors, which forms the new representation for each token.
Encoder and Decoder
➢ This is applied to each token independently and consists of two fully connected layers
with a ReLU activation function in between.
➢ The FFN allows for more complex transformations of the token representations.
Like the encoder, the decoder consists of N identical layers, with each layer having three
main sub-components:
•Encoder-Decoder Attention Layer: This layer performs attention over the encoder's
output.
• It allows each token in the decoder to focus on different parts of the input sequence
when making predictions.
•Feed-Forward Neural Network (FFN): Like in the encoder, the output of the attention
layers is passed through a position-wise feed-forward network to transform the
representations.
Each decoder layer also includes residual connections and layer normalization, which
help improve training by mitigating the vanishing gradient problem and ensuring
stable learning
Generative AI and its Applications
(UE22CS342BA9)
GPT
GPT
References:
Book: Build a Large Language Model (From Scratch)
GPT
References:
Book: Build a Large Language Model (From Scratch)
GPT
Key Components
Token Processing
When text enters the system, it undergoes two crucial encoding steps:
● Token embeddings convert words into vector representations
● Positional encodings add sequence information to maintain word order
Self-Attention Mechanism
The self-attention mechanism is what gives GPT its contextual understanding capabilities. It processes input
through three main components:
● Query vectors
● Key vectors
● Value vectors
These vectors work together to compute attention scores, allowing the model to weigh the importance of different
words in relation to each other
References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
NATURAL LANGUAGE PROCESSING
GPT(Generative Pre-training Transformer)
GPT Architecture
Let's take a closer look at the general GPT (Generative Pretrained Transformer).
GPT models are highly capable text completion models.
They can perform a variety of tasks beyond text generation, such as:
● Spelling correction
● Text classification
● Language translation
This versatility is remarkable, considering that GPT models are trained on a relatively simple task: next-
word prediction.
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture
GPT Architecture
● The GPT architecture is relatively simple compared to the original Transformer architecture (More on
this later).
● It consists of only the decoder part (more on this in the next section) of the Transformer model,
without the encoder.
● Key characteristics of GPT architecture:
○ It is a decoder-style model.
○ It generates text by predicting one word at a time, making it an autoregressive model.
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture
Autoregressive Models
● GPT models are considered autoregressive because:
○ They incorporate their previous outputs as
inputs for future predictions.
○ Each new word is chosen based on the
sequence of words that precedes it.
● This autoregressive nature improves the
coherence of the generated text.
References:
Book: Build a Large Language Model (From Scratch)
GPT
Processing Pipeline
Input Processing
The model first tokenizes input text into smaller units and converts them into embeddings. These embeddings then
receive positional information to maintain sequence order.
Transformer Layers
Each transformer layer contains:
● A multi-head self-attention mechanism
● A feed-forward neural network
● Layer normalization components
● Residual connections
Output Generation
The final layer produces probability distributions over the vocabulary, enabling the model to predict the most likely next
token in a sequence
References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
GPT
Figure:
Two Stage Process
References:
Book: Build a Large Language Model (From Scratch)
GPT
Emergent Behaviour:
Emergent behavior refers to skills or abilities that a model develops naturally during training,
even though they were not directly taught or targeted. Exposure to massive amounts of
multilingual data in various contexts allows models like GPT to "learn" translation patterns
between languages. This means GPT can perform translation tasks without being explicitly
trained for them
Generative AI and its Applications
(UE22CS342BA9)
BERT
BERT
References:
Book: Build a Large Language Model (From Scratch)
BERT
Transformers
“As the animal entered the forest, … blending it into the shadows of
the towering trees.”
• RNN’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
• LSTM’s window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
• Attention window-
As the animal entered the forest, … blending it into the shadows of the towering trees.
Attention
Attention tells the model what parts of the sentence to focus on.
Picture Credit
Attention
For each word in the sentence, we need to compute a relational score between
the word and every other word in the sentence.
The higher the score, the more related they are contextually or on some parameter.
Table Credit
MACHINE LEARNING
Attention
To get a better intuition of why we’re doing this, let’s take a look at how
Information retrieval systems work in particular for google search.
MACHINE LEARNING
Attention
• The database stores <key, value> pairs of documents.
• Keys can be specific characteristics / terms in the document and values
will be the document itself.
• For e.g. on previous slide, we take Keys as the title of the web page and the value
is the web page itself.
• When the user submits a Query, a similarity score is computed for each Key with
the Query and the top 𝑁 ranked keys are returned.
To compute the cosine similarity between the Query and Key vectors, we
perform a dot product multiplication between them.
We can compute the values for 𝑎1 , 𝑎2 , … , 𝑎𝑛 parallelly if we combine all the word vectors
into a single matrix and perform matrix multiplication between the Query and Key matrices
Let 𝐻 = ℎ1 , ℎ2 , … , ℎ𝑛 and hence is of dimension 𝑛 × 𝑑𝑚𝑜𝑑𝑒𝑙
𝑄 = 𝐻 × 𝑊𝑞
𝐾 = 𝐻 × 𝑊𝑘
𝑉 = 𝐻 × 𝑊𝑣
hence 𝑄, 𝐾, 𝑉 matrices are of dimension 𝑛 × 𝑑𝑞 , 𝑛 × 𝑑𝑘 , 𝑛 × 𝑑𝑣 respectively.
Attention
• Hence the attention filter,
𝑸𝑲𝑻
𝑨 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙
𝒅𝒌
where, the dimensions of 𝐴 are 𝑛 × 𝑛. (𝑛 is the no. of words in the input)
• This matrix now acts as a filter and tells the model which parts of
the sentence is more important than others and which are less
important.
• In the end, we combine all the attention matrices from the different
heads and pass it through a last linear transformation (𝑊𝑂 ).
Picture Credit
Attention
Different Attentions
from the Multi-Head
Attention module.
Picture Credit
Architecture of Transformers
Now that we know what Attention is, Let’s look at the architecture of Transformers.
For example:
Input – The dog safely
Output – crossed the road.
• First the sentence “the dog safely” is passed to the encoder which
generates a context and gives this to the decoder.
• In the 1st run, the decoder is given a special <start> token. It calculates a probability for
each word in the model’s vocabulary and selects the word with the highest probability.
• Let’s say this word is “crossed”. Then in the 2nd run, the decoder is given “crossed” and
the process repeats till a special <end> token.
Architecture of Transformers
• In the 1st run, the decoder is given a special <start> token. It calculates a probability for
each word in the model’s vocabulary and selects the word with the highest probability.
• Let’s say this word is “crossed”. Then in the 2nd run, the decoder is given “crossed” and
the process repeats a special <end> token.
preview slide to watch animation
Architecture of Transformers
• During Training however, we need to give both the input and the expected
output to the model.
• For an input “the dog safely”, let’s say the decoder generated the word
“touched”.
• If we continued to give the model it’s own wrong predicted word, future
generated words would be related to this wrong word.
• Embeddings
• Positional Encoding
Encoder Components-
• Multi-Head (Self) Attention
• Fully Connected Feed Forward Network
• Residual Connections & Layer Normalization
Decoder Components-
• Masked Multi-Head (Self) Attention
• Multi-Head (Cross) Attention
• For our example, we’ll use a vocabulary of only 5 words [‘crossed’, ‘dog’,
‘road’, ‘safely’, ‘the’].
• The vector embedding’s are randomly initialized in the beginning and are
learnt during training.
So, for an Input of “the dog safely”, the following input matrix 𝐻𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 will be generated :
We need some way of encoding the positional data of the words in the sentence
Hence, we use the following to encode the position of the word along with the embedding-
𝑝𝑜𝑠
For even position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 2𝑖 , where 𝑖 is the dimension.
ൗ𝑑
10000 𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
For odd position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖+1 = cos 2𝑖ൗ , and 𝑝𝑜𝑠 is the position.
10000 𝑑𝑚𝑜𝑑𝑒𝑙
Positional Encoding
𝑝𝑜𝑠
For even position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 2𝑖ൗ , where 𝑖 is the dimension.
10000 𝑑𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
For odd position, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖+1 = cos 2𝑖ൗ , and 𝑝𝑜𝑠 is the position.
10000 𝑑𝑚𝑜𝑑𝑒𝑙
‘the’ is in position 0 -
𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻0 0.86 0.34 0.49 0.57
sin 0/10000(2∗0/4) cos 0/10000(2∗0/4) sin 0/10000(2∗1/4) cos 0/10000(2∗1/4)
𝑃𝐸 0 1 0 1
Positional Encoding
‘dog’ is in position 1 -
𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻1 0.42 0.34 0.34 0.49
sin 1/10000(2∗0/4) cos 1/10000(2∗0/4) sin 1/10000(2∗1/4) cos 1/10000(2∗1/4)
𝑃𝐸 0.8415 0.5403 0.01 0.9999
‘safely’ is in position 2 -
𝑖𝑛𝑑𝑒𝑥 0 1 2 3
𝐻2 0.56 0.98 0.86 0.42
sin 2/10000(2∗0/4) cos 2/10000(2∗0/4) sin 2/10000(2∗1/4) cos 2/10000(2∗1/4)
𝑃𝐸 0.9093 −0.4161 0.02 0.9998
Positional Encoding
We finally add the 𝑃𝐸 with the input embeddings:
𝐻3×4 = 𝐻3×4 + 𝑃𝐸
The original paper uses 8 heads, for our example, we’ll use only 2 heads.
𝑑𝑚𝑜𝑑𝑒𝑙
The original paper uses 𝑑𝑞 = 𝑑𝑘 = 𝑑𝑣 = = 64, we’ll use 2 as our dimension.
ℎ
𝑊𝑞0 , 𝑊𝑘0 , 𝑊𝑣0 and 𝑊𝑞1 , 𝑊𝑘1 , 𝑊𝑣1 are initialized randomly.
Assuming we get:
1.2922 1.6889
𝑄0 = 𝐻𝑊𝑞0 = 1.2541 2.5407
0.6617 2.7296
0.0182 0.4839
𝐾0 = 𝐻𝑊𝑘0 = 0.5597 0.8135
0.4560 1.2663
2.6708 −1.1731
𝑉0 = 𝐻𝑊𝑣0 = 2.6051 −0.5519
3.5441 −1.0239
Encoder – Multi-Head (Self) Attention
𝑄0 𝐾0𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉=
𝑑𝑘
3.1075 −0.8858
= 3.1800 −0.9001
3.1928 −0.9127
Encoder – Multi-Head (Self) Attention
The dimensionality of input and output is 𝑑𝑚𝑜𝑑𝑒𝑙 and the inner-layer has a
dimensionality 𝑑𝑓𝑓 . In the original paper, 𝑑𝑓𝑓 = 2048.
“the” 3.1521 3.9323 0.5346 −3.9235 “the” −2.3621 14.1629 −7.4911 −21.0523
“dog” 3.1898 3.9741 0.4894 −3.9718 “dog” −2.2789 14.1132 −7.5580 −21.4520
“safely” 3.4267 4.1227 0.6870 −4.1507 “safely” −2.5964 15.0122 −7.9764 −21.3564
Encoder – Residual Connections
Till now, we’ve seen operations which form one layer of the encoder.
Each layer consists of –
The Encoder is composed of a stack of multiple such layers.
In the original paper, it consists of 6 layers.
In the decoder block, during training, we provide the expected target (ground
truth). But to enforce the “Teacher Forcing” approach, we use masks to hide
future information so that the decoder only uses previous and current words.
By placing −∞ in the places where we don’t want the model to know the
attention scores, SoftMax computes them as 0, and hence future words are
successfully hidden from the model.
• The last word embedding (1 × 𝑑𝑚𝑜𝑑𝑒𝑙 ) of the last output layer is passed
through a linear transformation 𝑊𝐷 of size 𝑑𝑚𝑜𝑑𝑒𝑙 × 𝑉𝑜𝑐𝑎𝑏 𝑠𝑖𝑧𝑒.
• This transformation outputs a vector which has a score for each word in the
vocabulary. SoftMax is applied on the vector and the word with the highest
probability is chosen as the predicted word.
• The predicted word is then passed as input to the decoder again until the
<end> token is chosen. (auto-regressive)
Final Layer
Types of Transformers
Picture Credit
Types of Transformers – Encoder Only Transformer
These models use only the encoder part of the transformer architecture.
The encoder focuses on understanding the input sequence by attending to
all parts of it, and learning deep contextualized representations of the input.
The output from the encoder can then be used for various downstream tasks.
Use Cases:
• Text Classification: Classifying the sentiment of a sentence or detecting spam.
• Named Entity Recognition (NER): Identifying entities like names, places, or
organizations in text.
• Token Classification: Part-of-speech tagging, labeling each word/token in a sentence.
• Masked Language Modeling (MLM): Predicting missing words in a sentence.
Examples:
• BERT (Bidirectional Encoder Representations from Transformers)
Types of Transformers – Decoder Only Transformer
These models use only the decoder part of the transformer architecture.
The encoder focuses on generating sequences, often used for autoregressive tasks where
the model generates the next word or token in a sequence, given the preceding ones.
Decoders rely on self-attention for past tokens and generate outputs one token at a time.
Use Cases:
• Text Generation: Autoregressively generating text, like completing a sentence or story
generation.
• Language Modeling: Predicting the next word or sequence of words in a sentence.
• Code Generation: Automatically generating code from a prompt or comments.
• Dialogue Systems: Used in chatbot and conversational systems to generate responses.
Examples:
• GPT (Generative Pretrained Transformer) & GROVER
Types of Transformers – Encoder + Decoder Transformer
Use Cases:
• Machine Translation: Translating sentences from one language to another.
• Summarization: Converting long articles into shorter summaries.
• Text-to-Text: Tasks like sentence paraphrasing or converting a question into an answer.
• Speech-to-Text: Converting spoken language into written text (speech recognition).
Examples:
• T5 (Text-to-Text Transfer Transformer) & mBART (Multilingual BART)
MACHINE LEARNING
References
• Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need,
2023. – Link
• Mitesh Khapra (IITM Professor), Transformer Slides – Link
• Polo Club (Georgia Tech), Interactive Transformer Explainer – Link
• Michael Nguyen (The AI Hacker), Illustrated Guide to Transformers - Link
• Grant Sanderson (3b1b), Neural Networks - Link
• Yannic Kilcher, Explaining “Attention is all you need paper” - Link
• Batool Haider (Hedu AI), A visual guide to transformers – Link
• Tom Yeh, AI by Hand - Link
• Fareed Khan, Solving Transformers by Hand - Link
THANK YOU
Generative AI and its Applications
LLM Architecture and Models
BERT
Dr. Arti Arya
Department of Computer Science and
Engineering
BERT
BERT looks at all words (before and after the mask) to figure out what word
should go in the blank.
During training:
● It masks 15% of words randomly
● Sometimes it replaces words with incorrect ones to learn better
● It learns to predict the original word by understanding the entire context
BERT
References:
Book: Build a Large Language Model (From Scratch)
BERT
Architecture
BERT uses an encoder-only transformer architecture consisting
of four key modules:
● A tokenizer that converts text into integer sequences.
● An embedding layer that transforms tokens into vectors
embeddings.
● A transformer encoder stack with self-attention.
● A task head that produces probability distributions
over tokens.
BERT
[Link]
BERT: Fine-tuning(Teaching BERT New Tasks)
• (In contrast to ELMo, BERT’s fine-tuning process involves updating all the
model parameters (both the pre-trained encoder and the added task-specific
layers) during task-specific training.)
BERT
BERT: Overview
● Almost every other word in the English language has multiple meanings. In spoken
word, it is even worse because of homophones and prosody.
● For instance, “four candles” and “fork handles” for those with an English accent.
Another example: comedians’ jokes are mostly based on the play on words because
words are very easy to misinterpret.
● It’s not very challenging for us because we have common sense and context so we
can understand all the other words that surround the context of the situation or the
conversation – but search engines and machines don’t.
● This does not bode well for conversational search into the future.
BERT
● Eg.
“I like the way that looks like the other one.”
By Stanford Part-of-Speech Tagger , “like” is considered to be two separate parts
of speech (POS).
● The word “like” may be used as different parts of speech including verb, noun,
and adjective.
● So literally, the word “like” has no meaning because it can mean whatever
surrounds it.
● The context of “like” changes according to the meanings of the words that
surround it.
● The longer the sentence is, the harder it is to keep track of all the different
parts of speech within the sentence.
BERT
BERT almost
have the same
model
architecture
across different
tasks with small
changes
between the pre-
trained
architecture and
the final
downstream
architecture
BERT
The term "causal" in this context comes from the fact that the model
learns to predict the next token in the sequence based on the causal
influence of the previous tokens
GPT(Generative Pre-training Transformer)
GPT(Generative Pre-training Transformer)
• Following the similar idea of ELMo, OpenAI GPT (Radford et al., 2018), expands
the unsupervised language model to a much larger scale by training on a giant
collection of free text corpora.
• Despite of the similarity, GPT has two major differences from ELMo.
Let’s say, the context window of the size k is located before the
target word and the loss would be:
GPT
• As discussed earlier, GPT is
based on the Decoder
component of the
transformer architecture.
• At its heart, the model
processes input tokens through
multiple stacked transformer
layers, each containing self-
attention mechanisms and
feedforward neural networks.
References:
Book: Build a Large Language Model (From Scratch)
GPT: Key Components
• Token Processing
When text enters the system, it undergoes two crucial encoding steps:
● Token embeddings convert words into vector representations
● Positional encodings add sequence information to maintain word order
• Self-Attention Mechanism
The self-attention mechanism is what gives GPT its contextual understanding
capabilities. It processes input through three main components:
● Query vectors
● Key vectors
● Value vectors
These vectors work together to compute attention scores, allowing the model
to weigh the importance of different words in relation to each other.
References:
Blog: A Deep Dive into GPT's Transformer Architecture: Understanding
Self-Attention Mechanisms
GPT Architecture
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture: A quick Review
● The GPT architecture is relatively simple compared to the original Transformer
architecture.
● It consists of only the decoder part of the Transformer model, without the
encoder.
autoregressive model.
References:
Book: Build a Large Language Model (From Scratch)
GPT Architecture
Autoregressive Models
● GPT models are considered autoregressive
because:
○ They incorporate their previous outputs as
References:
Book: Build a Large Language Model (From Scratch)
GPT : Training Methodology
GPT models undergo a two-phase training process:
• Pre-training
During this phase, the model learns from vast amounts of text
data, developing an understanding of language patterns and
relationships.
• Fine-tuning
The model is then refined on specific tasks with human
feedback, improving its ability to generate contextually
appropriate responses
GPT
Figure:
Two Stage Process
References:
Book: Build a Large Language Model (From Scratch)
GPT: Recent Developments
• Modern GPT iterations have seen exponential growth in their capabilities.
GPT-4, o1 etc: while its exact architecture remains private, demonstrates
significant improvements in reasoning and contextual understanding
compared to its predecessors (multimodal)
Emergent Behaviour:
Emergent behavior refers to skills or abilities that a model develops naturally
during training, even though they were not directly taught or targeted.
Exposure to massive amounts of multilingual data in various contexts allows
models like GPT to "learn" translation patterns between languages. This means
GPT can perform translation tasks without being explicitly trained for them
Generative AI and its Applications
LLM Architecture and Models
RoBERTa
RoBERTa: A More Robust BERT
• The researchers found that BERT was significantly undertrained, and with
some key modifications, they could substantially improve its performance.
References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
RoBERTa: What Makes RoBERTa Different?
• 1. Longer Training:
While the original BERT was trained for a fixed number of steps, RoBERTa was trained for much longer,
allowing it to better learn from the training data. They also used larger batches during training, which
helped stabilize the learning process.
• 2. NSP Removal:
They removed what's called the "Next Sentence Prediction" (NSP) objective. Originally, BERT was
trained to predict whether two text segments appeared next to each other in the original text. The
researchers found this task wasn't necessary and might even be harmful to the model's performance.
• 3. Dynamic masking:
They introduced dynamic masking. In the original BERT, the masking pattern (which words get hidden
from the model) was static - created once during data preprocessing. RoBERTa uses dynamic masking,
where the masking pattern is generated everytime the model sees a sequence. This prevents the
model from memorizing specific patterns and forces it to learn more robust features.
References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
RoBERTa: The Data Advantage
RoBERTa wasn't just trained differently - it was trained on more data. The
researchers used multiple datasets totalling 160GB of text, including:
The Results
The improvements paid off dramatically. RoBERTa achieved state-of-the-art
results on several key benchmarks:
References:
Paper: RoBERTa: A Robustly Optimized BERT Pretraining
Approach
Generative AI and its Applications
(UE22CS342BA9)
BART
BART: Bidirectional and Auto-Regressive Transformers
References:
Paper: BART Paper
Bidirectional and Auto-Regressive Transformers
BART: Bidirectional and Auto-Regressive Transformers
Architecture
• BART employs a standard Transformer-based sequence-to-sequence model
with a bidirectional encoder and an auto-regressive decoder.
References:
Paper: BART Paper
BART: Pre-training Approach
BART's pre-training process involves two key steps:
1. Text Corruption: Documents are corrupted using
various noise functions:
● Token masking: Random tokens replaced with [MASK]
● Token deletion: Random tokens removed
● Text infilling: Text spans replaced with single [MASK]
token
● Sentence permutation: Sentences shuffled randomly
● Document rotation: Text rotated around random
token
2. Reconstruction:
The model learns to reconstruct the original text from
corrupted input References:
Paper: BART Paper
BART: Fine-tuning Capabilities
BART demonstrates remarkable versatility in fine-tuning for different tasks:
Sequence Classification
● Input is fed to both encoder and decoder
● Final decoder token state used for classification
Generation Tasks
● Direct fine-tuning for summarization and dialogue
● Autoregressive decoder enables natural text generation
Machine Translation
● BART can serve as a pre-trained decoder
● New encoder learned for source language
References:
Paper: BART Paper
Generative AI and its Applications
(UE22CS342BA9)
LLM Architecture
LLM Architecture
• Large Language Models (LLMs) represent a revolutionary advancement in artificial intelligence,
built upon transformers (sophisticated neural networks designed) to process and understand
human language.
• The architecture begins with tokenization, where text is broken into smaller units that the model
can process. These tokens are then transformed into numerical vectors through the embedding
layer, capturing semantic , syntactic and contextual information.
• At the heart of modern LLMs lies the transformer architecture, introduced in 2017.
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture: Processing Mechanism
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
2. Contextual Representation:
○ The hidden state captures the context of the input sequence, enabling the
decoder to generate contextually relevant outputs.
3. Flexibility:
○ This architecture can handle variable-length input and output sequences,
making it versatile for many NLP tasks.
References:
Book: Build a Large Language Model (From Scratch)
Sinan Ozdemir - Quick Start Guide to Large Language Models_ Strategies and Best Practices for using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
How LLM Works
➢ Prompt Response Flow:
When a question is entered, it becomes part of a prompt within Chatgpt that interacts with pre-
trained LLM like GPT-4 to generate response.
➢ Pre-training of LLM
LLMs are pre-trained on vast amount of data at significant computational costs. They
utilize neural networks , weights and biases to improve model prediction.
Pre-training from scratch is costly and impractical for most users. Fine-tuning existing LLMs with
your data is a more viable option.
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
● The original Transformer (encoder-decoder)model, was specifically designed for
tasks like language translation.
● In contrast, GPT models use a simpler decoder-only architecture and are primarily
trained for next-word prediction. Despite this, GPT models are surprisingly
capable of performing translation tasks.
● This ability to handle translation was unexpected because GPT was not explicitly
trained for it. Instead, this capability is an example of what researchers call
"emergent behavior".
● .
References:
Book: Build a Large Language Model (From Scratch)
LLM Architecture
● Emergent behavior refers to skills or abilities that a model develops
naturally during training, even though they were not directly taught or
targeted.
References:
Book: Build a Large Language Model (From Scratch)
Generative AI and its Applications
(UE22CS342BA9)
DeepSeek-R1
DeepSeek-R1: Released on 20th Jan 2025
• DeepSeek R1 is an open-source AI model that stands out for its reasoning-
centric design.
References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025
References:
DeepSeek R1:
All you need to
know )
DeepSeek-R1: Released on 20th Jan 2025
References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025
•1. Domain-Specific Optimization:
• DeepSeek-R1: Designed with industry-specific fine-tuning out-of-the-
box, making it highly effective for specialized tasks (e.g., healthcare,
finance, legal).
• ChatGPT: A general-purpose model that requires additional fine-tuning
for domain-specific applications, increasing time and resource costs.
•2. Computational Efficiency:
• DeepSeek-R1: Built with a compact and optimized architecture,
reducing computational costs for training and inference. It is ideal for
real-time applications and resource-constrained environments.
• ChatGPT: Requires significant computational resources due to its large-
scale architecture (e.g., 175B+ parameters), making it expensive to
deploy and maintain.
References:
DeepSeek R1: All you need to know )
DeepSeek-R1
References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025
References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Released on 20th Jan 2025
References:
DeepSeek R1: All you need to know )
DeepSeek-R1: Usecases and Applications
References:
DeepSeek R1: All you need to know )