0% found this document useful (0 votes)
24 views63 pages

Word-Level Analysis in NLP Techniques

nlp notes

Uploaded by

vishv0111
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views63 pages

Word-Level Analysis in NLP Techniques

nlp notes

Uploaded by

vishv0111
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module : 2

Word Level
Analysis
What is Word-Level
Analysis?

• Breaking down text into individual


words (tokens)
• Examining their properties, such
as their meaning, structure, and
relationships within a sentence
• The fundamental step is crucial for
various NLP tasks, like
• Text analysis
• Sentiment analysis
• Machine translation
Components of Word-Level Analysis
1. Tokenization
2. Morphological Analysis
3. Lexical Semantics/Word Sense Disambiguation
(Module 4)
4. Part-of-Speech Tagging (Module 3)
5. Stemming and Lemmatization
6. Word Embeddings
1. Tokenization

• Process of breaking down a continuous stream of text


into smaller, meaningful units called "tokens."
• These tokens can be words, sub words, characters, or
sentences, depending on the specific type of
tokenization employed and the requirements of the NLP
task.
• Converts human-readable language into a format that
machines can understand and process.
• It lays the groundwork for all subsequent NLP
operations, allowing algorithms to effectively analyze,
interpret, and generate human language.
2. Morphological
Analysis
• Morphology is the
study of how words
are formed and how
they relate to other
words in the same
language.

[Link]
[Link]
Concatenative Morphology

This refers to constructing words by joining


morphemes linearly (in sequence).

A morpheme is the Free morphemes can stand alone (e.g.,


write, like)
smallest meaningful unit in Bound morphemes must be attached to
a language. other morphemes (e.g., re-, -s, -ly)

Prefix: added before the root (e.g., re- in


An affix is a type of bound rewrite)
morpheme. Suffix: added after the root (e.g., -s in
writes)
Examples of Concatenative
Morphology

unlikely = un +
like + ly
These words are
rewrites = re +
formed by
write + s
concatenating
multiple
morphemes.
Two Broad
2.1 Inflectional
Ways to
Form Morphology
Words
Using 2.2 Derivational
Morphemes Morphology
2.1 Inflectional Morphology
• Inflectional morphology deals with modifying
existing words to express different grammatical
functions without changing their core meaning or
part of speech.
• Key Characteristics:
• No change in word class
e.g., "write" (verb) → "writes" (still a verb)
• Predictable changes based on rules Often involves
affixation (adding prefixes/suffixes) or vowel changes
• Expresses tense, number, person, gender, case, etc.
Inflectional Categories (Common
in English)
• Number – singular vs. plural (cat → cats)
• Tense – present vs. past (walk → walked)
• Person – I write vs. He writes
• Aspect – She is running (progressive)
• Case – He (subjective) vs. Him (objective)
Inflectional
Morphology
Example
Notes:
Regular Irregular
Regular nouns form plurals by Type
adding -s or -es Nouns Nouns
cat → cats, thrush → thrushes
Irregular nouns use vowel Singula cat,
change or other forms mouse, ox
mouse → mice, ox → oxen
r thrush
cats, mice,
Plural
thrushes oxen
2.2
It involves the creation of new words (lexemes) by adding
derivational affixes to a base or root word.
Results in a change of word class (part of speech) and

Derivatio
meaning.
Key Features:

nal
Creates new lexemes, not just variations of the same word.

Often changes the syntactic category (e.g., verb → noun).

Morpholo Adds new, non-grammatical meaning to the base.

gy Less predictable than inflectional morphology.

Different from compounding, which combines full words (e.g.,


"toothbrush").
Derivational Morphology Examples
Suffix Base (Verb/Adjective) Derived Noun
-ation computerize (V) computerization
-ee appoint (V) appointee
-er kill (V) killer
-ness fuzzy (A) fuzziness

•These suffixes transform verbs/adjectives into nouns.


•Each derived word has a meaning related to the base, but with added nuance or a role
•The examples of regular verbs are walk, walks, walking, walked.
•Similarly, irregularly inflected verbs are: “eat, eats, eating, ate, eaten,
catch, catches, cut, cuts, cutting, caught,” etc.
Feature Inflectional Morphology Derivational Morphology
Changes meaning ❌ No ✅ Yes
Changes word class ❌ No ✅ Often
Predictable and rule-based ✅ Yes ❌ Less predictable
Example write → writes write → writer
5. Stemming
• Definition: Stemming is a heuristic process
that chops off suffixes and sometimes
prefixes from words to obtain a "stem,"
which is often a crude form of the word's
root. The resulting stem may not necessarily
be a valid word in the language.
• Algorithms apply a set of rules to remove
common affixes.
• Eg. the Porter Stemmer uses rules to remove
English suffixes like "-ing," "-ed," or "-s".
• Example: The words "running," "runs," and
"runner" might all be stemmed to "run."
However, "universal," "university," and
"universe" could all be stemmed to
"univers," which isn't a valid word and
conflates different meanings.
Lemmatization
• Definition: Lemmatization is a more sophisticated
process that reduces words to their "lemma," which is
the base or dictionary form of a word, always ensuring
the output is a valid word.
• Lemmatization uses a vocabulary (like WordNet) and
morphological analysis, often incorporating Part-of-
Speech (POS) tagging to understand the word's context
and return its correct base form.
• Example: "Running," "ran," and "runs" would all be
lemmatized to "run." Crucially, "better" would be
lemmatized to "good," recognizing its base form as an
adjective.
Stemming vs. Lemmatization – Key
Differences
Aspect Stemming Lemmatization
Dictionary-based with
Rule-based (removal of
Approach morphological and context
suffixes/prefixes)
analysis
Always yields valid
May produce non-words
Output dictionary words (e.g.,
(e.g., "argu" from "arguing")
"good" from "better")
Faster (simpler and Slower (requires lookups and
Speed
lightweight) POS tagging)
Less accurate, may over- or More accurate, context- and
Accuracy
under-stem grammar-aware
•Stemming is good for speed-focused applications (e.g., search engines).
•Lemmatization is better for semantic accuracy and linguistic correctness.
•In most modern NLP tasks, lemmatization is preferred, especially when meaning matters.
Regular Expression
A regular expression (regex) is a sequence of
characters that forms a search pattern, used for:
• Matching
• Extracting
• Replacing
• Validating
...text based on specific patterns.
Use Regular Expressions in NLP
and Programming
• Tokenizing text
• Searching for specific word patterns (e.g., emails, phone
numbers)
• Cleaning data (e.g., removing punctuation)
• Validating inputs (e.g., password rules, dates)
Finite automata
• Finite automata (FA), also known as finite state
automata (FSA), are abstract machines that play a
foundational role in NLP, particularly in tasks involving
patterns and regular languages.
• They are computational models used to recognize
patterns in sequences of input symbols
A Simple Finite Automaton
A Simple Finite Automaton
A Simple Finite Automaton
• (Q, Σ, q₀, F, δ)
• Q = set of all states={q0, q1, q2, q3}
• Σ = inputs={1,0}
• q₀ = start state / initial state={q0}
• F = set of final states={q3}
• δ = transition function from Q × Σ → Q 0 1
q0 q1 q3
q1 q0 q2
q2 q3 q1
q3 q2 q0
Finite State Automata (FSA)
designed to accept variations of the
word "baa+!",
specifically:
✅ Accepted Strings:
baa!
baaa!
baaaaa..!

❌ Rejected Strings:
ba!
b!
Transition Table
N-gram Model(N-Gram
Language Model
Exercises using Bi-Gram, Tri-gram & Four-gram)
Understanding language patterns through word sequences
Understanding N-
gram Models
Definition of
N-gram
Definition and Scope
An n-gram is a sequence of n contiguous
items from text or speech, such as words or
letters.

Applications in NLP
N-grams are essential for tasks like text
prediction, speech recognition, and machine
translation.

Probability Modeling
N-grams model sequence probabilities to help
predict or evaluate language data
computationally.
Conditional
Probability in N-
gram Models
Conditional Probability Formula
Conditional probability calculates the
likelihood of an event given a preceding event
using P(B|A) = P(A,B)/P(A).

Joint Probability Decomposition


Joint probability can be decomposed into P(A)
multiplied by P(B|A), simplifying complex
probability calculations.

Bigram Model Simplification


Bigram models predict a word based solely on
the immediately preceding word, capturing
useful context efficiently.
Multi-variable
Probability
Formulation
Extending N-gram Models
N-gram models extend to multiple variables to
capture deeper contextual relationships in
language sequences.

Probability Chain Formula


The multi-variable probability is expressed as
a chain of conditional probabilities capturing
sequence dependencies.

Applications and Challenges


Multi-variable models enhance speech
recognition and text generation but increase
computational complexity.
N-gram Model
• An n-gram is a contiguous sequence of n items from a given sample of text
or speech. The items can be phonemes, syllables, letters, words, or base
pairs according to the application. The n-grams typically are collected from
a text or speech corpus.
• Conditional Probability:
P(B∣A)=P(A,B)P(A)
• And
• P(A,B)=P(A)⋅P(B∣A)P(B∣A)=P(A)P (A,B) andP (A,B)=P(A)⋅P (B∣A)
• More variables:
P(A,B,C,D)=P(A)⋅P(B∣A)⋅P(C∣A,B)⋅P(D∣A,B,C)P (A,B,C,D)
• =P (A)⋅P (B∣A)⋅P (C∣A,B)⋅P (D∣A,B,C)
• P(X1, X2, ... , Xn) = P(X1) P(X2 | X1) P(X3 | X1, X2) ...
Chain Rule P(Xn | X1, ... , Xn-1)
• P(about five minutes from) = P(about) × P(five |
about) × P(minutes | about five) × P(from | about five
minutes)
Probability of words in
sentences

•P(W1, W2, ... , Wn) = Π P(Wi | W1, W2, ... , Wi-1) • If you want to compute
•Unigram (1-gram): No history is used. the probability of the
•Bigram (2-gram): One word history. sentence:
•Trigram (3-gram): Two words history. "I love Mumbai", you
•Four-gram (4-gram): Three words history. would calculate:
•Five-gram (5-gram): Four words history.
practical applications of N-grams

• Usually
Bi-gram (previous one
word),
Tri-gram (previous two
words),
Four-gram (previous three
words) are used.
Unigram (1-gram) Model
Unigram (1-gram): No history is
used.
“about five minutes from…..”
Assume in corpus dinner word is
present with highest probability.
Unigram doesn’t take into account
probabilities with previous words
like from, minutes.
Unigram will predict dinner.
“about five minutes from dinner”
Bigram-Based
Word
Prediction
Example
Sentences
and Word
Frequencies
Next Word
Prediction
Probabilities
• You are given the sentence:
"I liek college"
• The word "liek" is likely misspelled. Use a trigram
model to find the most probable correction from the
following candidates:
• like
• lick
• lie
• life
• Corpus
• I like college
• I like Henry
• Do I like college
• I do like Henry
• Henry likes college
• 1. Trigram Context
• We are interested in the trigram: "I [candidate] college"
• 2. Count Trigrams in Corpus
• From the corpus, we extract trigram frequencies:
• "I like college" → appears 2 times
• "I like Henry" → appears 1 time
• "I do like Henry" → "do like Henry" (not relevant here)
• "Do I like college" → "I like college" again
• "Henry likes college" → not relevant
• So:
• "I like college" = 2 occurrences
• "I lick college" = 0
• "I lie college" = 0
• "I life college" = 0
3. Apply Probabilities

Assuming raw frequency as probability:

Candidate Trigram Count Probability


like I like college 2 High
lick I lick college 0 Low
lie I lie college 0 Low
life I life college 0 Low

Answer:
The most probable correction is "like", giving the
corrected sentence:
"I like college"
[Link]
natural_language_processing.html

You might also like