What is NLP ?
• NLP is Natural Language Processing.
Natural languages are those spoken by people.
• NLP encompasses anything a computer needs to understand natural
language (typed or spoken) and also generate the natural language.
• Natural Language Processing (NLP) is a subfield of Artificial intelligence
and linguistic, devoted to make computers "understand" statements
written in human languages.
Language modeling :
Language modeling is the way of determining the probability of
any sequence of words. Language modeling is used in a wide
variety of applications such as Speech Recognition, Spam
filtering, information extraction, prompt generarion etc. In fact,
language modeling (that’s why LLM) is the key aim behind the
implementation of many state-of-the-art Natural Language
Processing models.
N-grams are contiguous sequences of items that are collected
from a sequence of text or speech corpus or almost any type of
data. The n in n-grams specify the size of a number of items to
consider, unigram for n =1, bigram for n = 2, and trigram for n = 3,
and so on. n-gram and n-gram models are widely used in
probability, communication theory, computational linguistics like
statistical natural language processing), computational biology etc.
Sometimes it is also regarded as “bag of words”
Methods of Language Modelings:
Two types of Language Modelings:
Statistical Language Modelings: Statistical Language Modeling,
or Language Modeling, is the development of probabilistic models
that are able to predict the next word in the sequence given the
words that precede. Examples such as N-gram language modeling.
Neural Language Modelings: Neural network methods are
achieving better results than classical methods both on standalone
language models and when models are incorporated into larger
models on challenging tasks like speech recognition and machine
translation. A way of performing a neural language model is
through word embeddings.
N-Grams
Probability of a sentence can be calculated by the probability of sequence of
words occurring in it.
We can use Markov assumption, that the probability of a word in a sentence
depends on the probability of the word occurring just before it.
Such a model is called first order Markov model or the bigram model.
Here, Wn refers to the word token corresponding to the nth word in a sequence.
A combination of words forms a sentence. However, such a formation
is meaningful only when the words are arranged in some order.
Ex: Sit I car in the.
Such a sentence is not grammatically acceptable. However some
perfectly grammatically correct sentences can be nonsensical too!
Eg: Colorless green ideas sleep furiously.
One easy way to handle such unacceptable sentences is by assigning
probabilities to the strings of words i.e, how likely the sentence is in
that particular form.
Probability of a sentence
If we consider each word occurring in its correct location as an
independent event, the probability of the sentences is :
Sentence Structure: P(w(1), w(2)..., w(n-1), w(n))
Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Bigrams
We can avoid this very long calculation by approximating that the
probability of a given word depends only on the probability of its
previous words.
This assumption is called Markov assumption and such a model is
called Markov model- bigrams.
Bigrams can be generalized to the n-gram which looks at (n-1)
words in the past.
A bigram is a first-order Markov model.
Therefore , P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|w(2)) ... P(w(n)|w(n-1))
We use (eos) tag to mark the beginning and end of a sentence.
A bigram table for a given corpus can be generated and used as a
lookup table for calculating probability of sentences.
Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read (eos)
N-Grams Smoothing
One major problem with standard N-gram models is that they
must be trained from some corpus, and because any particular
training corpus is finite, some perfectly acceptable N-grams are
bound to be missing from it.
We can see that bigram matrix for any given training corpus is
sparse. There are large number of cases with zero probabilty
bigrams and that should really have some non-zero probability.
This method tend to underestimate the probability of strings that
happen not to have occurred nearby in their training corpus.
There are some techniques that can be used for assigning a non-
zero probabilty to these 'zero probability bigrams'. This task of
reevaluating some of the zero-probability and low-probabilty N-
grams, and assigning them non-zero values, is called smoothing.
This task of reevaluating some of the zero-probability and low-
probability N-grams, and assigning them non-zero values, is called
smoothing.
Some of the techniques are:
1. Add-One Smoothing,
2. Witten-Bell Discounting,
3. Good-Turing Discounting.
Add-One Smoothing
In Add-One smoothing, we add one to all the bigram counts before
normalizing them into probabilities. This is called add-one
smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram
probability can be computed by dividing the count of the word by
the total number of word tokens N.
P(wx) = c(wx)/sumi{c(wi)} = c(wx)/N
Let there be an adjusted count c.
ci = (c i+1 * N/(N+V))
where where V is the total number of word types in the language.
Now, probabilities can be calculated by normalizing counts by N.
pi* = (c i+1)/(N+V)
Application on bigrams
Normal bigram probabilities are computed by normalizing each row
of counts by the unigram count:
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the
unigram count by the number of total word types in the vocabulary
V: p*(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )
POS Tagging - Hidden Markov Model
POS tagging or part-of-speech tagging is the procedure of
assigning a grammatical category like noun, verb, adjective etc. to
a word.
In this process both the lexical information and the context play an
important role as the same lexical form can behave differently in a
different context.
For example the word "Park" can have two different lexical
categories based on the context.
The boy is playing in the park. ('Park' is Noun)
Park the car. ('Park' is Verb)
Assigning part of speech to words by hand is a common exercise
one can find in an elementary grammar class.
But here we wish to build an automated tool which can assign the
appropriate part-of-speech tag to the words of a given sentence.
One can think of creating hand crafted rules by observing patterns
in the language, but this would limit the system's performance to
the quality and number of patterns identified by the rule crafter.
Thus, this approach is not practically adopted for building POS
Tagger. Instead, a large corpus annotated with correct POS tags for
each word is given to the computer and algorithms then learn the
patterns automatically from the data and store them in form of a
trained model.
Later this model can be used to POS tag new sentences.
A Hidden Markov Model (HMM) is a statistical Markov model in
which the system being modeled is assumed to be a Markov process
with unobserved (hidden) states. In a regular Markov model the state is
directly visible to the observer, and therefore the state transition
probabilities are the only parameters. In a hidden Markov model, the
state is not directly visible, but output, dependent on the state, is
visible.
Hidden Markov Model has two important components-
1)Transition Probabilities: The one-step transition probability is the
probability of transitioning from one state to another in a single step.
2)Emission Probabilties: : The output probabilities for an observation
from state. Emission probabilities B = { bi,k = bi(ok) = P(ok | qi) },
where ok is an Observation. Informally, B is the probability that the
output is ok given that the current state is qi
For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word.
Hence, transition matrix denotes the transition probability from one POS to another and
emission matrix denotes the probability that a given word can have a particular POS. Word
acts as the observations.
Calculating the Probabilities
Consider the given corpus
EOS/eos They/pronoun cut/verb the/determiner paper/noun
EOS/eos He/pronoun asked/verb for/preposition his/pronoun
cut/noun. EOS/eos Put/verb the/determiner paper/noun
in/preposition the/determiner cut/noun EOS/eos
Calculating Emission Probability Matrix
Count the no. of times a specific word occus with a specific POS tag in the corpus.
Here, say for "cut“
count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0
and so on zero for other tags too.
count(cut) = total count of cut = 3
Now, calculating the probability
Probability to be filled in the matrix cell at the intersection of cut and verb
P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33
Similarly, Probability to be filled in the cell at he intersection of cut and determiner
P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0
Calculate for P(cut/noun)?
Calculating Transition Probability Matrix
Count the no. of times a specific tag comes after other POS tags in the corpus.
Here, say for "determiner“
count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0
and so on zero for other tags too.
count(determiner) = total count of tag 'determiner' = 3
Now, calculating the probability Probability to be filled in the cell at he intersection of
determiner(in the column) and verb(in the row)
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66 Similarly,
Probability to be filled in the cell at he intersection of determiner(in the column) and
noun(in the row)
P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0
Repeat the same for all the tags