Language Model
Probabilistic Language Models
• Today’s goal: assign a probability to a sentence
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction
Why? • The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence
of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
• Better: the grammar But language model or LM is standard
How to compute P(W)
• How to compute this joint probability:
• P(its, water, is, so, transparent, that)
• Intuition: let’s rely on the Chain Rule of Probability
Reminder: The Chain Rule
• Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence
P(w1w2 … wn ) = Õ P(wi | w1w2 … wi-1 )
i
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is
so)
How to estimate these probabilities
• Could we just count and divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
• No! Too many possible sentences!
• We’ll never see enough data for estimating these
Markov Assumption
•Simplifying assumption:
Andrei Markov
P(the | its water is so transparent that) » P(the | that)
•Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Markov Assumption
P(w1w2 … wn ) » Õ P(wi | wi-k … wi-1 )
i
•In other words, we approximate each
component in the product
P(wi | w1w2 … wi-1) » P(wi | wi-k … wi-1)
Language
Modeling
Introduction to N-grams
Language
Modeling
Estimating N-gram
Probabilities
Estimating bigram probabilities
• The Maximum Likelihood Estimate
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Build the Bigram Model:
Assume our training corpus contains the following
sentences:
"I am happy"
"I am sad"
"I am excited"
"You are happy"
Bigrams: Unigrams:
("I", "am"): 3 "I": 3
("am", "happy"): 2 "am": 3
("am", "sad"): 1 "happy": 2
("am", "excited"): 1 "sad": 1
("You", "are"): 1 "excited": 1
("are", "happy"): 1 "You": 1
"are": 1
Calculate Bigram Probabilities:
Calculate the Probability of the Full Sentence:
When some type the words I am, the probability of next word
is Happy or Sad or Excited
Given the following trigram probabilities derived from a training corpus,
calculate the probability of the sentence "She is feeling very good."
Trigrams:
Unigrams:
("She", "is", "feeling"): 2 "She": 2
("is", "feeling", "very"): 1 "is": 2
("feeling", "very", "good"): 2 "feeling": 2
"very": 2
Bigrams: "good": 2
•("She", "is"): 2
•("is", "feeling"): 1
•("feeling", "very"): 1
•("very", "good"): 2
P("She is feeling very good")=P("feeling"∣"She is")×P("very"∣"is feeling")
×P("good"∣"feeling very")
=1.0×1.0×2.0 =2.0
The probability of the sentence "She is feeling very good" using the
trigram model is 2.0.
Problem Statement
Given the partial sentence "She is feeling", predict the most likely next word using the trigram model.
Trigram Model Data
From the training corpus, we have the following trigrams and their counts:
Trigrams:
("She", "is", "feeling"): 2
("is", "feeling", "very"): 1
("is", "feeling", "good"): 1
("feeling", "very", "good"): 2
("feeling", "good", "happy"): 1
Bigrams:
("She", "is"): 2
("is", "feeling"): 1
("feeling", "very"): 1
("feeling", "good"): 1
("good", "happy"): 1
𝐶𝑜𝑢𝑛𝑡(“𝑖𝑠 𝑓𝑒𝑒𝑙𝑖𝑛𝑔 𝑔𝑜𝑜𝑑”)
P(“good”|”is feeling”)= =1
𝐶𝑜𝑢𝑛𝑡(“𝑖𝑠 𝑓𝑒𝑒𝑙𝑖𝑛𝑔 ”)
Example Sentence: The Cat