CSE751: ADVANCED NATURAL LANGUAGE
PROCESSING
Farig Sadeque
Associate Professor
Department of Computer Science and Engineering
BRAC University
Lecture 3: Sequence Learning
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.
Because part-of-speech changes with the surrounding sequence:
- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.
How big is this ambiguity issue?
Part-of-speech ambiguity
Most words in the English vocabulary are unambiguous.
Part-of-speech ambiguity
But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?
10 cut NN
25 cut VB
13 cut VBD
7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Recurrent neural networks
Why POS Tagging Must Model Sequences
Our running example:
Secretariat is expected to race tomorrow.
Secretariat is ________
Race is ________
To understand context, we will predict all tags together.
Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules
You can learn these rules: see Transformation-based Learning: [Link]
Approach 1: Hidden Markov Models
- Let’s put the probability theory we covered in
the previous lecture to use!
- The resulting approach is called Hidden
Markov model
- Discovered by Andrey Markov
- Limited horizon
Markov Chain
- We often need to calculate the probability of a sequence: P(q1, q2, . . . , qi)
- Under the (first-order) Markov assumption, the future depends only on the
present, not the past. Formally, given a sequence of states, q, we assume:
P(qi|q1, q2, . . . , qi−1) ≝ P(qi|qi−1)
- A Markov chain applies the Markov assumption to estimate the probability of a
sequence;
e.g., P(q1, q2, q3, q4) = P(q4|q1, q2, q3)P(q3|q1,q2)P(q2|q1)P(q1)
≝ P(q4|q3)P(q3|q2)P(q2|q1)P(q1)
Markov Chain
A Markov chain consists of:
- Q = q1, q2, . . . , qN : a set of N states
- π = π1, π2, . . . , πN : an initial probability distribution
- A = a11a12 . . . aNN : a transition probability matrix where aij is the probability of
moving from state i to state j; each row of A sums to 1
Hidden Markov Model
- A Markov chain models a sequence of observations.
- A hidden Markov model assumes that there is a causal factor associated with
each observation in the sequence.
- For example, part-of-speech “causes” word:
Hidden Markov Model
- Q = q1, q2, . . . , qN a set of N states
- V = v1, v2, . . . , vM a set of M possible observations
- π = π1, π2, . . . , πN an initial probability distribution
- A = a11a12 . . . aNN a transition probability matrix where aij is the probability of
moving from state i to state j; each row of A sums to 1
- B = b11b12 . . . bNM an emission probability matrix where bij is the probability of
state i emitting observation j; each row of B sums to 1
Approach 1: Hidden Markov Models
• Sentence 1 contains n words
• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this
We have four probabilities: likelihood, prior, posterior and marginal likelihood.
- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags
- A tag is dependent only on the previous tag
Replace in the original equation
Word Tag transition
likelihoods probabilities
Computing Tag Transition Probabilities
In the Brown corpus (1M words)
- DT occurs 116,454 times
- DT is followed by NN 56,509 times
Computing Word Likelihoods
In the Brown corpus (1M words)
- VBZ occurs 21,627 times
- VBZ is the tag for “is” 10,073 times
Example
Let’s see why VB is preferred in the first case
Example
Example
The first tag transition
- P(NN|TO) = 0.00047
- P(VB|TO) = .83
The word likelihood for “race”
- P(race|NN) = 0.00057
- P(race|VB) = 0.00012
The second tag transition
- P(NR|VB) = 0.0027
- P(NR|NN) = 0.0012
Example
P(VB|TO)P(NR|VB)P(race|VB) = 0.00000027
P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032
VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities
B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm
Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space
This is
called a
One row for
trellis
each state
(tag)
One column for each observation (word)
Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O
Output
- Most probable state sequence Q together with its probability
Both A and B are matrices with probabilities
Example of A and B matrices
A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070
B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054
Example Trace
Summary of Viterbi Algorithm
• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:
- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.
It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation
Approach 2: Maximum Entropy Markov Model
Uses features:
- Capitalization suggests a proper noun
- The suffix ing suggests a verb
- The suffix ion suggests a noun
- The suffix able suggests an adjective
- etc.
Comparison of HMMs and MEMMs
HMM MEMM
Predicts
Estimates
Comparison of HMMs and MEMMs
HMM MEMM
Predicts
Estimates
Common Extra Features for MEMMs
Almost always useful:
- n previous tags (not just wt−1)
- n previous and following words
Especially useful for words not in the training data:
- Prefixes/suffixes up to length n
- Character classes: capital vs. lower, digits, etc.
Classifying with MEMMs
- Just train logistic regression to learn b and θ
Decoding with MEMMs
As with HMMs, the argmax means that we have to search through all possible tag
sequences of length T.
Options:
- Greedy: make a hard decision at each word
- Viterbi: try all sequences using dynamic programming
Bidirectionality
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Bidirectionality
Bidirectional options:
- Cyclic dependency network
- Bidirectional version of MEMM
- Not commonly used
- (Linear chain) conditional random field (CRF)
- Same model shape as MEMM, but undirected
- Still somewhat in use
- Bidirectional recurrent neural network
- Similar in spirit to forward MEMM + backward MEMM
- The most popular approach since ≈2015
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example