0% found this document useful (0 votes)

3 views48 pages

Sequence Learning in POS Tagging

Uploaded by

wasifahad09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views48 pages

Sequence Learning in POS Tagging

Uploaded by

wasifahad09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CSE751: ADVANCED NATURAL LANGUAGE

PROCESSING
Farig Sadeque
Associate Professor
Department of Computer Science and Engineering
BRAC University
Lecture 3: Sequence Learning
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.
Because part-of-speech changes with the surrounding sequence:
- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.
How big is this ambiguity issue?
Part-of-speech ambiguity

Most words in the English vocabulary are unambiguous.

Part-of-speech ambiguity

But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?

10 cut NN

25 cut VB

13 cut VBD

7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Recurrent neural networks
Why POS Tagging Must Model Sequences
Our running example:

Secretariat is expected to race tomorrow.

Secretariat is ________

Race is ________

To understand context, we will predict all tags together.

Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules

You can learn these rules: see Transformation-based Learning: [Link]

Approach 1: Hidden Markov Models
- Let’s put the probability theory we covered in
the previous lecture to use!
- The resulting approach is called Hidden
Markov model
- Discovered by Andrey Markov
- Limited horizon
Markov Chain
- We often need to calculate the probability of a sequence: P(q1, q2, . . . , qi)
- Under the (first-order) Markov assumption, the future depends only on the
present, not the past. Formally, given a sequence of states, q, we assume:
P(qi|q1, q2, . . . , qi−1) ≝ P(qi|qi−1)
- A Markov chain applies the Markov assumption to estimate the probability of a
sequence;
e.g., P(q1, q2, q3, q4) = P(q4|q1, q2, q3)P(q3|q1,q2)P(q2|q1)P(q1)
≝ P(q4|q3)P(q3|q2)P(q2|q1)P(q1)
Markov Chain
A Markov chain consists of:
- Q = q1, q2, . . . , qN : a set of N states
- π = π1, π2, . . . , πN : an initial probability distribution
- A = a11a12 . . . aNN : a transition probability matrix where aij is the probability of
moving from state i to state j; each row of A sums to 1
Hidden Markov Model
- A Markov chain models a sequence of observations.
- A hidden Markov model assumes that there is a causal factor associated with
each observation in the sequence.
- For example, part-of-speech “causes” word:
Hidden Markov Model
- Q = q1, q2, . . . , qN a set of N states
- V = v1, v2, . . . , vM a set of M possible observations
- π = π1, π2, . . . , πN an initial probability distribution
- A = a11a12 . . . aNN a transition probability matrix where aij is the probability of
moving from state i to state j; each row of A sums to 1
- B = b11b12 . . . bNM an emission probability matrix where bij is the probability of
state i emitting observation j; each row of B sums to 1
Approach 1: Hidden Markov Models

• Sentence 1 contains n words

• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags

- A tag is dependent only on the previous tag

Replace in the original equation

Word Tag transition

likelihoods probabilities
Computing Tag Transition Probabilities
In the Brown corpus (1M words)

- DT occurs 116,454 times

- DT is followed by NN 56,509 times
Computing Word Likelihoods
In the Brown corpus (1M words)

- VBZ occurs 21,627 times

- VBZ is the tag for “is” 10,073 times
Example

Let’s see why VB is preferred in the first case

P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032

VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:

- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities

- Used to compute the prior probabilities (probability of a tag)

- Often called tag transition probabilities

B: observation likelihoods

- Used to compute the likelihood probabilities (probability of a word given tag)

- Often called word likelihoods
Testing: Viterbi Algorithm

Viterbi algorithm

- Computes the argmax efficiently

- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space

This is
called a
One row for
trellis
each state
(tag)

One column for each observation (word)

Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O
Output
- Most probable state sequence Q together with its probability

Both A and B are matrices with probabilities

Example of A and B matrices

A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

Example Trace
Summary of Viterbi Algorithm

• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:

- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model

- Capitalization, hyphenated, suffixes, etc.

It’s possible but every such feature must be encoded in the p(word|tag)

- Redesign the model for every feature!

- MEMMs avoid this limitation
Approach 2: Maximum Entropy Markov Model
Uses features:
- Capitalization suggests a proper noun
- The suffix ing suggests a verb
- The suffix ion suggests a noun
- The suffix able suggests an adjective
- etc.
Comparison of HMMs and MEMMs

HMM MEMM

Predicts

Estimates
Comparison of HMMs and MEMMs

HMM MEMM

Predicts

Estimates
Common Extra Features for MEMMs
Almost always useful:
- n previous tags (not just wt−1)
- n previous and following words
Especially useful for words not in the training data:
- Prefixes/suffixes up to length n
- Character classes: capital vs. lower, digits, etc.
Classifying with MEMMs

- Just train logistic regression to learn b and θ

Decoding with MEMMs
As with HMMs, the argmax means that we have to search through all possible tag
sequences of length T.

Options:
- Greedy: make a hard decision at each word
- Viterbi: try all sequences using dynamic programming
Bidirectionality
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Bidirectionality
Bidirectional options:
- Cyclic dependency network
- Bidirectional version of MEMM
- Not commonly used
- (Linear chain) conditional random field (CRF)
- Same model shape as MEMM, but undirected
- Still somewhat in use
- Bidirectional recurrent neural network
- Similar in spirit to forward MEMM + backward MEMM
- The most popular approach since ≈2015
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example

Understanding Hidden Markov Models in NLP
No ratings yet
Understanding Hidden Markov Models in NLP
107 pages
Sequence Learning in NLP: POS Tagging
No ratings yet
Sequence Learning in NLP: POS Tagging
50 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
46 pages
WINSEM2024-25 CSE3015 ETH AP2024254000281 2025-02-12 Reference-Material-II
No ratings yet
WINSEM2024-25 CSE3015 ETH AP2024254000281 2025-02-12 Reference-Material-II
88 pages
Group 1
No ratings yet
Group 1
6 pages
Viterbi Algorithm for POS Tagging in Python
No ratings yet
Viterbi Algorithm for POS Tagging in Python
7 pages
PoS Tagging Techniques in NLP
No ratings yet
PoS Tagging Techniques in NLP
50 pages
NLP - Syntax Analysis
No ratings yet
NLP - Syntax Analysis
36 pages
Sequential Learning
No ratings yet
Sequential Learning
73 pages
HMM Architecture for POS Tagging Explained
No ratings yet
HMM Architecture for POS Tagging Explained
8 pages
HMMs for POS Tagging and Sequence Tasks
No ratings yet
HMMs for POS Tagging and Sequence Tasks
37 pages
Optimization Techniques in NLP
No ratings yet
Optimization Techniques in NLP
23 pages
Overview of Hidden Markov Models
No ratings yet
Overview of Hidden Markov Models
41 pages
HMM and MaxEnt Models for POS Tagging
No ratings yet
HMM and MaxEnt Models for POS Tagging
84 pages
POS Tagging with HMM & Viterbi Algorithm
No ratings yet
POS Tagging with HMM & Viterbi Algorithm
27 pages
HMM and Viterbi Algorithm Explained
No ratings yet
HMM and Viterbi Algorithm Explained
5 pages
Forward-Backward Algorithm & HMM for POS Tagging
No ratings yet
Forward-Backward Algorithm & HMM for POS Tagging
5 pages
NLP
No ratings yet
NLP
24 pages
Exp 8
No ratings yet
Exp 8
8 pages
POS Tagging Techniques in NLP
No ratings yet
POS Tagging Techniques in NLP
8 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
8 pages
HMM for POS Tagging Explained
No ratings yet
HMM for POS Tagging Explained
11 pages
Techniques for POS Tagging
No ratings yet
Techniques for POS Tagging
12 pages
Lecture 8
No ratings yet
Lecture 8
102 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
51 pages
Understanding POS Tagging Techniques
No ratings yet
Understanding POS Tagging Techniques
5 pages
Hidden Markov Models in NLP
No ratings yet
Hidden Markov Models in NLP
60 pages
HMM for POS Tagging Overview
No ratings yet
HMM for POS Tagging Overview
4 pages
NLP Machine Learning: Sequence Labeling
No ratings yet
NLP Machine Learning: Sequence Labeling
37 pages
HMM for Part of Speech Tagging
No ratings yet
HMM for Part of Speech Tagging
59 pages
Sequence Labeling in NLP Overview
No ratings yet
Sequence Labeling in NLP Overview
79 pages
Viterbi Algorithm for HMM Parameter Learning
No ratings yet
Viterbi Algorithm for HMM Parameter Learning
50 pages
Overview of Hidden Markov Models
No ratings yet
Overview of Hidden Markov Models
23 pages
Unit2 QB With Answers
No ratings yet
Unit2 QB With Answers
22 pages
POS Tagging with Markov Models
No ratings yet
POS Tagging with Markov Models
60 pages
Understanding Hidden Markov Models in NLP
No ratings yet
Understanding Hidden Markov Models in NLP
2 pages
Lecture 4.0 (NLP)
No ratings yet
Lecture 4.0 (NLP)
16 pages
AI Stack and NLP Techniques Overview
No ratings yet
AI Stack and NLP Techniques Overview
27 pages
Understanding Part-of-Speech Tagging
No ratings yet
Understanding Part-of-Speech Tagging
97 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
11 pages
HMM for Part-of-Speech Tagging
No ratings yet
HMM for Part-of-Speech Tagging
22 pages
5 Natural Language Processing
No ratings yet
5 Natural Language Processing
7 pages
HMMs and Statistical Sequence Classification
No ratings yet
HMMs and Statistical Sequence Classification
47 pages
Part-of-Speech Tagging Techniques
No ratings yet
Part-of-Speech Tagging Techniques
83 pages
Hidden Markov Model (HMM) Tutorial: Home Ciphers Cryptanalysis Hashes Resources
No ratings yet
Hidden Markov Model (HMM) Tutorial: Home Ciphers Cryptanalysis Hashes Resources
5 pages
Understanding POS Tagging in NLP
No ratings yet
Understanding POS Tagging in NLP
43 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
12 pages
State-Space Models in HMMs and Kalman Filters
No ratings yet
State-Space Models in HMMs and Kalman Filters
29 pages
Corpus Analysis and HMM in NLP
No ratings yet
Corpus Analysis and HMM in NLP
8 pages
Hidden Markov Models in NLP LMs
No ratings yet
Hidden Markov Models in NLP LMs
81 pages
Overview of Hidden Markov Models
No ratings yet
Overview of Hidden Markov Models
9 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
35 pages
HMM in Natural Language Processing
No ratings yet
HMM in Natural Language Processing
3 pages
POS-Tagging Algorithm Accuracy Report
No ratings yet
POS-Tagging Algorithm Accuracy Report
10 pages
Part of Speech Tagging Overview
No ratings yet
Part of Speech Tagging Overview
36 pages
Word Level Analysis in NLP Techniques
No ratings yet
Word Level Analysis in NLP Techniques
14 pages
HMM-Based POS Tagger Implementation
No ratings yet
HMM-Based POS Tagger Implementation
3 pages
TCS Previous Year DSA Questions
No ratings yet
TCS Previous Year DSA Questions
68 pages
25 Interview Questions - Agentic AI
No ratings yet
25 Interview Questions - Agentic AI
22 pages
Let's Take A 25-Day GitHub Challenge
No ratings yet
Let's Take A 25-Day GitHub Challenge
14 pages
T-Shirt Printing Business Guide India
No ratings yet
T-Shirt Printing Business Guide India
1 page
Machine Learning Engineer at Onboardly
No ratings yet
Machine Learning Engineer at Onboardly
4 pages
Cost of Studying in the UK for Students
No ratings yet
Cost of Studying in the UK for Students
17 pages
AI Chat Log Summarizer Tool Guide
No ratings yet
AI Chat Log Summarizer Tool Guide
3 pages
Designing for Rural Women in Bangladesh
No ratings yet
Designing for Rural Women in Bangladesh
14 pages
AIML Job IQ Guide Ed1
No ratings yet
AIML Job IQ Guide Ed1
48 pages
Optimizing Cloud Slim Hardware Performance
No ratings yet
Optimizing Cloud Slim Hardware Performance
15 pages
Importance of Tablet Coating Explained
No ratings yet
Importance of Tablet Coating Explained
8 pages
Professional Development Planning Guide
No ratings yet
Professional Development Planning Guide
7 pages
Communicative Task Conventions Guide
No ratings yet
Communicative Task Conventions Guide
1 page
Interactive Virtual Learning Strategies
No ratings yet
Interactive Virtual Learning Strategies
49 pages
Intro to Macbeth Lesson Plan ELA
No ratings yet
Intro to Macbeth Lesson Plan ELA
3 pages
Komposisi: Jurnal Pendidikan Bahasa, Sastra, Dan Seni: P-ISSN 1411-3732 E-ISSN 2548-9097
No ratings yet
Komposisi: Jurnal Pendidikan Bahasa, Sastra, Dan Seni: P-ISSN 1411-3732 E-ISSN 2548-9097
10 pages
Education For Sustainable Development in Further Education: Edited by Denise Summers and Roger Cutting
No ratings yet
Education For Sustainable Development in Further Education: Edited by Denise Summers and Roger Cutting
298 pages
Flipped Classroom vs. Traditional Lectures
No ratings yet
Flipped Classroom vs. Traditional Lectures
13 pages
Study Habits and Learning Styles in Pharmacy Education
No ratings yet
Study Habits and Learning Styles in Pharmacy Education
6 pages
IELTS Speaking Practice Guide
No ratings yet
IELTS Speaking Practice Guide
11 pages
Hospital Readmission Prediction Using Machine Learning Techniques
No ratings yet
Hospital Readmission Prediction Using Machine Learning Techniques
10 pages
Ls4Uae Focus Lesson - Division 2: The Little Lump of Clay
No ratings yet
Ls4Uae Focus Lesson - Division 2: The Little Lump of Clay
4 pages
English and Digital Literacies
No ratings yet
English and Digital Literacies
42 pages
ASL Introduction Unit Plan Summary
No ratings yet
ASL Introduction Unit Plan Summary
5 pages
Teaching Strategies Gold Assessment Touring Guide
No ratings yet
Teaching Strategies Gold Assessment Touring Guide
36 pages
Verbal Fluency Linked to Longevity
No ratings yet
Verbal Fluency Linked to Longevity
6 pages
French-English Linguist & Content Specialist
No ratings yet
French-English Linguist & Content Specialist
1 page
Synthetic Division Lesson Plan for Grade 10
75% (4)
Synthetic Division Lesson Plan for Grade 10
4 pages
Basics of Machine Learning Explained
No ratings yet
Basics of Machine Learning Explained
86 pages
Educational Psychology Overview
No ratings yet
Educational Psychology Overview
21 pages
Research Project Timeline and Activities
No ratings yet
Research Project Timeline and Activities
7 pages
Understanding Speed vs. Velocity in Motion
No ratings yet
Understanding Speed vs. Velocity in Motion
4 pages
Mid-Semester Exam: Conversational AI
No ratings yet
Mid-Semester Exam: Conversational AI
4 pages
Eric Carle: A Colorful Journey in Education
No ratings yet
Eric Carle: A Colorful Journey in Education
18 pages
Effective Reading Remediation Instructional Strate PDF
No ratings yet
Effective Reading Remediation Instructional Strate PDF
6 pages
Enhance Listening with English Pop Songs
No ratings yet
Enhance Listening with English Pop Songs
7 pages
English 6 Daily Lesson Log: Week 1
No ratings yet
English 6 Daily Lesson Log: Week 1
3 pages
Application for Music Educator Position
No ratings yet
Application for Music Educator Position
1 page
Inferring Character Traits in English IV
No ratings yet
Inferring Character Traits in English IV
10 pages
Learning Centered Teaching: Foundation and Characteristics
100% (2)
Learning Centered Teaching: Foundation and Characteristics
14 pages
Cultivating Creativity in Education
No ratings yet
Cultivating Creativity in Education
11 pages

Sequence Learning in POS Tagging

Uploaded by

Sequence Learning in POS Tagging

Uploaded by

CSE751: ADVANCED NATURAL LANGUAGE

Most words in the English vocabulary are unambiguous.

Secretariat is expected to race tomorrow.

To understand context, we will predict all tags together.

You can learn these rules: see Transformation-based Learning: [Link]

• Sentence 1 contains n words

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- A tag is dependent only on the previous tag

Word Tag transition

- DT occurs 116,454 times

- VBZ occurs 21,627 times

Let’s see why VB is preferred in the first case

- Used to compute the prior probabilities (probability of a tag)

- Used to compute the likelihood probabilities (probability of a word given tag)

- Computes the argmax efficiently

One column for each observation (word)

Both A and B are matrices with probabilities

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

- Capitalization, hyphenated, suffixes, etc.

- Redesign the model for every feature!

- Just train logistic regression to learn b and θ

You might also like