0% found this document useful (0 votes)
7 views28 pages

Understanding Natural Language Processing

Natural Language Processing (NLP) is a subfield of Artificial Intelligence focused on enabling computers to understand and generate human languages. Language modeling, including techniques like n-grams and smoothing methods, plays a crucial role in applications such as speech recognition and POS tagging. The document also discusses the use of Hidden Markov Models for automating part-of-speech tagging through statistical methods.

Uploaded by

62adsrtyuiop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

Understanding Natural Language Processing

Natural Language Processing (NLP) is a subfield of Artificial Intelligence focused on enabling computers to understand and generate human languages. Language modeling, including techniques like n-grams and smoothing methods, plays a crucial role in applications such as speech recognition and POS tagging. The document also discusses the use of Hidden Markov Models for automating part-of-speech tagging through statistical methods.

Uploaded by

62adsrtyuiop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

What is NLP ?

• NLP is Natural Language Processing.


Natural languages are those spoken by people.

• NLP encompasses anything a computer needs to understand natural


language (typed or spoken) and also generate the natural language.

• Natural Language Processing (NLP) is a subfield of Artificial intelligence


and linguistic, devoted to make computers "understand" statements
written in human languages.
Language modeling :
Language modeling is the way of determining the probability of
any sequence of words. Language modeling is used in a wide
variety of applications such as Speech Recognition, Spam
filtering, information extraction, prompt generarion etc. In fact,
language modeling (that’s why LLM) is the key aim behind the
implementation of many state-of-the-art Natural Language
Processing models.

N-grams are contiguous sequences of items that are collected


from a sequence of text or speech corpus or almost any type of
data. The n in n-grams specify the size of a number of items to
consider, unigram for n =1, bigram for n = 2, and trigram for n = 3,
and so on. n-gram and n-gram models are widely used in
probability, communication theory, computational linguistics like
statistical natural language processing), computational biology etc.
Sometimes it is also regarded as “bag of words”
Methods of Language Modelings:

Two types of Language Modelings:

Statistical Language Modelings: Statistical Language Modeling,


or Language Modeling, is the development of probabilistic models
that are able to predict the next word in the sequence given the
words that precede. Examples such as N-gram language modeling.

Neural Language Modelings: Neural network methods are


achieving better results than classical methods both on standalone
language models and when models are incorporated into larger
models on challenging tasks like speech recognition and machine
translation. A way of performing a neural language model is
through word embeddings.
N-Grams
Probability of a sentence can be calculated by the probability of sequence of
words occurring in it.

We can use Markov assumption, that the probability of a word in a sentence
depends on the probability of the word occurring just before it.

Such a model is called first order Markov model or the bigram model.

Here, Wn refers to the word token corresponding to the nth word in a sequence.
A combination of words forms a sentence. However, such a formation
is meaningful only when the words are arranged in some order.
Ex: Sit I car in the.

Such a sentence is not grammatically acceptable. However some


perfectly grammatically correct sentences can be nonsensical too!
Eg: Colorless green ideas sleep furiously.

One easy way to handle such unacceptable sentences is by assigning


probabilities to the strings of words i.e, how likely the sentence is in
that particular form.
Probability of a sentence
If we consider each word occurring in its correct location as an
independent event, the probability of the sentences is :

Sentence Structure: P(w(1), w(2)..., w(n-1), w(n))

Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Bigrams

We can avoid this very long calculation by approximating that the


probability of a given word depends only on the probability of its
previous words.

This assumption is called Markov assumption and such a model is


called Markov model- bigrams.

Bigrams can be generalized to the n-gram which looks at (n-1)


words in the past.

A bigram is a first-order Markov model.

Therefore , P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|w(2)) ... P(w(n)|w(n-1))


We use (eos) tag to mark the beginning and end of a sentence.
A bigram table for a given corpus can be generated and used as a
lookup table for calculating probability of sentences.

Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read (eos)
N-Grams Smoothing
One major problem with standard N-gram models is that they
must be trained from some corpus, and because any particular
training corpus is finite, some perfectly acceptable N-grams are
bound to be missing from it.

We can see that bigram matrix for any given training corpus is
sparse. There are large number of cases with zero probabilty
bigrams and that should really have some non-zero probability.

This method tend to underestimate the probability of strings that


happen not to have occurred nearby in their training corpus.

There are some techniques that can be used for assigning a non-
zero probabilty to these 'zero probability bigrams'. This task of
reevaluating some of the zero-probability and low-probabilty N-
grams, and assigning them non-zero values, is called smoothing.
This task of reevaluating some of the zero-probability and low-
probability N-grams, and assigning them non-zero values, is called
smoothing.

Some of the techniques are:

1. Add-One Smoothing,

2. Witten-Bell Discounting,

3. Good-Turing Discounting.

Add-One Smoothing
In Add-One smoothing, we add one to all the bigram counts before
normalizing them into probabilities. This is called add-one
smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram
probability can be computed by dividing the count of the word by
the total number of word tokens N.

P(wx) = c(wx)/sumi{c(wi)} = c(wx)/N

Let there be an adjusted count c.


ci = (c i+1 * N/(N+V))
where where V is the total number of word types in the language.

Now, probabilities can be calculated by normalizing counts by N.


pi* = (c i+1)/(N+V)
Application on bigrams

Normal bigram probabilities are computed by normalizing each row


of counts by the unigram count:

P(wn|wn-1) = C(wn-1wn)/C(wn-1)

For add-one smoothed bigram counts we need to augment the


unigram count by the number of total word types in the vocabulary

V: p*(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )


POS Tagging - Hidden Markov Model

POS tagging or part-of-speech tagging is the procedure of


assigning a grammatical category like noun, verb, adjective etc. to
a word.
In this process both the lexical information and the context play an
important role as the same lexical form can behave differently in a
different context.

For example the word "Park" can have two different lexical
categories based on the context.

The boy is playing in the park. ('Park' is Noun)

Park the car. ('Park' is Verb)


Assigning part of speech to words by hand is a common exercise
one can find in an elementary grammar class.

But here we wish to build an automated tool which can assign the
appropriate part-of-speech tag to the words of a given sentence.
One can think of creating hand crafted rules by observing patterns
in the language, but this would limit the system's performance to
the quality and number of patterns identified by the rule crafter.

Thus, this approach is not practically adopted for building POS


Tagger. Instead, a large corpus annotated with correct POS tags for
each word is given to the computer and algorithms then learn the
patterns automatically from the data and store them in form of a
trained model.

Later this model can be used to POS tag new sentences.


A Hidden Markov Model (HMM) is a statistical Markov model in
which the system being modeled is assumed to be a Markov process
with unobserved (hidden) states. In a regular Markov model the state is
directly visible to the observer, and therefore the state transition
probabilities are the only parameters. In a hidden Markov model, the
state is not directly visible, but output, dependent on the state, is
visible.

Hidden Markov Model has two important components-

1)Transition Probabilities: The one-step transition probability is the


probability of transitioning from one state to another in a single step.

2)Emission Probabilties: : The output probabilities for an observation


from state. Emission probabilities B = { bi,k = bi(ok) = P(ok | qi) },
where ok is an Observation. Informally, B is the probability that the
output is ok given that the current state is qi
For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word.
Hence, transition matrix denotes the transition probability from one POS to another and
emission matrix denotes the probability that a given word can have a particular POS. Word
acts as the observations.
Calculating the Probabilities

Consider the given corpus

EOS/eos They/pronoun cut/verb the/determiner paper/noun


EOS/eos He/pronoun asked/verb for/preposition his/pronoun
cut/noun. EOS/eos Put/verb the/determiner paper/noun
in/preposition the/determiner cut/noun EOS/eos
Calculating Emission Probability Matrix
Count the no. of times a specific word occus with a specific POS tag in the corpus.
Here, say for "cut“

count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0
and so on zero for other tags too.

count(cut) = total count of cut = 3

Now, calculating the probability


Probability to be filled in the matrix cell at the intersection of cut and verb

P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33

Similarly, Probability to be filled in the cell at he intersection of cut and determiner

P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0

Calculate for P(cut/noun)?


Calculating Transition Probability Matrix
Count the no. of times a specific tag comes after other POS tags in the corpus.
Here, say for "determiner“

count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0

and so on zero for other tags too.


count(determiner) = total count of tag 'determiner' = 3

Now, calculating the probability Probability to be filled in the cell at he intersection of


determiner(in the column) and verb(in the row)
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66 Similarly,

Probability to be filled in the cell at he intersection of determiner(in the column) and


noun(in the row)
P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0

Repeat the same for all the tags

You might also like