WSD SUPERVISED LEARNING
Naive Biased algorithm in NLP
A supervised Naive Bayes algorithm for Word Sense Disambiguation (WSD) classifies
the correct meaning of a word in a given context by calculating the probability of each
possible sense.
It relies on Bayes' theorem and the "naive" assumption that features used for
classification are independent of each other.
A "naive biased algorithm" for Word Sense Disambiguation (WSD) is not a standard
term in Natural Language Processing (NLP), but it refers to the application of the Naive
Bayes algorithm in a way that incorporates a specific form of bias.
The algorithm's "naive" assumption is that all contextual features (surrounding words)
are conditionally independent, and the "bias" can be introduced by leveraging prior
knowledge, such as the most frequent sense (MFS) heuristic.
How Naive Bayes works for WSD
The core task of WSD is to assign the correct meaning, or sense, to an ambiguous
word in a given context. Naive Bayes accomplishes this by treating it as a
classification problem, where the features are the words surrounding the ambiguous
word and the classes are the possible senses.
The process involves these steps:
Feature Extraction:
For a target ambiguous word, features are extracted from its surrounding
context. These features can include:
Collocation features: Words, their part-of-speech (POS) tags, or other lexical
information at specific positions relative to the target word (e.g., the word
immediately to the left, the POS tag of the word two positions to the right).
Bag-of-words features: The presence or frequency of words within a defined
window around the target word, without regard to their specific position.
Training Phase (Supervised WSD):
A sense-tagged corpus is used, where instances of ambiguous words are manually
labeled with their correct sense.
For each sense of a target word, the Naive Bayes model learns the conditional
probability of observing each feature given that sense. This involves calculating the
frequency of features associated with each sense.
Disambiguation Phase:
When presented with a new instance of an ambiguous word in a sentence, the
algorithm extracts the relevant features.
It then calculates the posterior probability of each possible sense given the observed
features using Bayes' theorem:
Advantages and disadvantages for WSD using Naïve Bised
Example
Using the word "bank" as an example, here is how a Naive Bayes algorithm would perform
Word Sense Disambiguation (WSD)
.
1. Training phase
The algorithm must first be trained on a corpus of text where the ambiguous word's senses have
already been manually tagged. Suppose the word "bank" has two senses:
Sense 1: Financial Institution (FI)
Sense 2: River Bank (RB)
The training data would look like this:
"I went to the bank to deposit money." (Sense: FI)
"The fisherman sat on the river bank." (Sense: RB)
"My savings account is at the local bank." (Sense: FI)
"The boat ran aground on the muddy bank." (Sense: RB)
From this data, the algorithm learns the following probabilities:
Prior probabilities: The overall likelihood of each sense.
o P(FI) = (2 financial examples) / (4 total examples) = 0.5
o P(RB) = (2 river examples) / (4 total examples) = 0.5
Likelihoods: The probability of context words appearing with each sense. A "Bag of
Words" model is used, which means the position of the words is ignored.
The algorithm calculates the conditional probability of each surrounding word for each sense:
2. Disambiguation (classification) phase
Now, consider a new, unseen sentence: "The money was deposited in the bank."
The algorithm needs to determine if "bank" in this sentence means "financial institution" or
"river bank".
Step 1: Identify context words
The context words (features) surrounding "bank" are "money," "deposited," and "in."
Step 2: Calculate posterior probability for each sense
Using Bayes' theorem, the algorithm calculates the probability of each sense given the context
words. To handle the "naive" assumption of feature independence, it multiplies the
probabilities of the individual context words.
Step 3: Apply the probabilities
Assuming for simplicity that the prior probabilities are equal (as they were in the training data),
we can focus on the likelihoods:
Step 4: Compare and classify
Probability (FI) > 0
Probability (RB) = 0
The algorithm will choose Sense 1 (Financial Institution) as the correct sense because it has
the highest probability.
DECISION LIST
Supervised word sense disambiguation (WSD) using decision lists is a machine learning
approach that creates an ordered list of weighted "if-then-else" rules to determine the correct
meaning of a word in context. It is particularly effective for words with a limited number of
distinct meanings.
Word Sense Disambiguation (WSD): A task in natural language processing (NLP)
that identifies the correct meaning (or sense) of a word in a specific context. For
example, determining if the word "bank" in "river bank" refers to a financial institution
or the side of a river.
Supervised Learning: This approach requires a pre-labeled training dataset, known as
a sense-tagged corpus, where each instance of an ambiguous word is manually tagged
with its correct sense. The algorithm learns a classifier from this data.
Decision Lists: A decision list is a sequence of rules, each of which has a
corresponding classification (or sense) and a confidence score. The rules are ordered
from most to least reliable. To classify a new instance, the system iterates through the
list and applies the first rule that matches the features of the input.
How decision lists work for WSD
1. Feature extraction: Before training, features are extracted from the sense-tagged
training data. These features represent the context surrounding the ambiguous word and
can include:
o Collocation features: The specific words that appear immediately to the left or
right of the target word.
o Bag-of-words features: The words that appear within a fixed-size window
around the target word, without regard to their position.
o Syntactic features: Part-of-speech (POS) tags of the surrounding words.
Example: Disambiguating the word "bass"
Consider the ambiguous word "bass," which can refer to a type of fish or the
sound/instrument.
Sense 1: fish
Sense 2: musical instrument/sound
A sense-tagged corpus would contain sentences like:
"I caught a large bass." (Sense 1)
"He plays the bass guitar in a band." (Sense 2)
More Rules
Rule 1 (high confidence): IF "guitar" is in the context THEN assign Sense 2
Rule 2 (medium confidence): IF "caught" is in the context THEN assign Sense 1
Rule 3 (medium confidence): IF "band" is in the context THEN assign Sense 2
Rule 4 (low confidence): IF the word two positions to the left is "large" THEN
assign Sense 1
Default rule (fallback): ASSIGN Most Frequent Sense
Applying the decision list:
Input sentence: "The bass sounded terrible."
The system checks Rule 1 for "guitar." No match.
It checks Rule 2 for "caught." No match.
It checks Rule 3 for "band." No match.
The system continues until it applies a rule or the default.
In practice, a more powerful feature might be "sounded," which would lead to a rule like IF
"sounded" is in the context THEN assign Sense 2, resulting in a more accurate
classification.
Advantages:
Simplicity and interpretability: The rule-based nature of decision lists makes them
easy to understand and debug a key advantage over more complex "black box" models.
Effectiveness with specific features: For many WSD problems, a few very strong
contextual features can provide accurate disambiguation. Decision lists are optimized to
exploit these features by placing the most confident rules at the top.
Scalability: Learning a decision list is computationally efficient and can handle a large
number of features without the data sparseness issues of methods like Naive Bayes.
Disadvantages:
Knowledge acquisition bottleneck: Creating the large sense-tagged corpora required
for training supervised models is time-consuming and expensive.
Limited expressiveness: The sequential "if-then" structure may not capture complex
interactions between multiple features as effectively as other machine learning models.
Manual feature engineering: The quality of the model heavily depends on the manual
selection and engineering of relevant contextual features.