0% found this document useful (0 votes)

24 views63 pages

Word-Level Analysis in NLP Techniques

nlp notes

Uploaded by

vishv0111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views63 pages

Word-Level Analysis in NLP Techniques

nlp notes

Uploaded by

vishv0111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Module : 2

Word Level
Analysis
What is Word-Level
Analysis?

• Breaking down text into individual

words (tokens)
• Examining their properties, such
as their meaning, structure, and
relationships within a sentence
• The fundamental step is crucial for
various NLP tasks, like
• Text analysis
• Sentiment analysis
• Machine translation
Components of Word-Level Analysis
1. Tokenization
2. Morphological Analysis
3. Lexical Semantics/Word Sense Disambiguation
(Module 4)
4. Part-of-Speech Tagging (Module 3)
5. Stemming and Lemmatization
6. Word Embeddings
1. Tokenization

• Process of breaking down a continuous stream of text

into smaller, meaningful units called "tokens."
• These tokens can be words, sub words, characters, or
sentences, depending on the specific type of
tokenization employed and the requirements of the NLP
task.
• Converts human-readable language into a format that
machines can understand and process.
• It lays the groundwork for all subsequent NLP
operations, allowing algorithms to effectively analyze,
interpret, and generate human language.
2. Morphological
Analysis
• Morphology is the
study of how words
are formed and how
they relate to other
words in the same
language.

[Link]
[Link]
Concatenative Morphology

This refers to constructing words by joining

morphemes linearly (in sequence).

A morpheme is the Free morphemes can stand alone (e.g.,

write, like)
smallest meaningful unit in Bound morphemes must be attached to
a language. other morphemes (e.g., re-, -s, -ly)

Prefix: added before the root (e.g., re- in

An affix is a type of bound rewrite)
morpheme. Suffix: added after the root (e.g., -s in
writes)
Examples of Concatenative
Morphology

unlikely = un +
like + ly
These words are
rewrites = re +
formed by
write + s
concatenating
multiple
morphemes.
Two Broad
2.1 Inflectional
Ways to
Form Morphology
Words
Using 2.2 Derivational
Morphemes Morphology
2.1 Inflectional Morphology
• Inflectional morphology deals with modifying
existing words to express different grammatical
functions without changing their core meaning or
part of speech.
• Key Characteristics:
• No change in word class
e.g., "write" (verb) → "writes" (still a verb)
• Predictable changes based on rules Often involves
affixation (adding prefixes/suffixes) or vowel changes
• Expresses tense, number, person, gender, case, etc.
Inflectional Categories (Common
in English)
• Number – singular vs. plural (cat → cats)
• Tense – present vs. past (walk → walked)
• Person – I write vs. He writes
• Aspect – She is running (progressive)
• Case – He (subjective) vs. Him (objective)
Inflectional
Morphology
Example
Notes:
Regular Irregular
Regular nouns form plurals by Type
adding -s or -es Nouns Nouns
cat → cats, thrush → thrushes
Irregular nouns use vowel Singula cat,
change or other forms mouse, ox
mouse → mice, ox → oxen
r thrush
cats, mice,
Plural
thrushes oxen
2.2
It involves the creation of new words (lexemes) by adding
derivational affixes to a base or root word.
Results in a change of word class (part of speech) and

Derivatio
meaning.
Key Features:

nal
Creates new lexemes, not just variations of the same word.

Often changes the syntactic category (e.g., verb → noun).

Morpholo Adds new, non-grammatical meaning to the base.

gy Less predictable than inflectional morphology.

Different from compounding, which combines full words (e.g.,

"toothbrush").
Derivational Morphology Examples
Suffix Base (Verb/Adjective) Derived Noun
-ation computerize (V) computerization
-ee appoint (V) appointee
-er kill (V) killer
-ness fuzzy (A) fuzziness

•These suffixes transform verbs/adjectives into nouns.

•Each derived word has a meaning related to the base, but with added nuance or a role
•The examples of regular verbs are walk, walks, walking, walked.
•Similarly, irregularly inflected verbs are: “eat, eats, eating, ate, eaten,
catch, catches, cut, cuts, cutting, caught,” etc.
Feature Inflectional Morphology Derivational Morphology
Changes meaning ❌ No ✅ Yes
Changes word class ❌ No ✅ Often
Predictable and rule-based ✅ Yes ❌ Less predictable
Example write → writes write → writer
5. Stemming
• Definition: Stemming is a heuristic process
that chops off suffixes and sometimes
prefixes from words to obtain a "stem,"
which is often a crude form of the word's
root. The resulting stem may not necessarily
be a valid word in the language.
• Algorithms apply a set of rules to remove
common affixes.
• Eg. the Porter Stemmer uses rules to remove
English suffixes like "-ing," "-ed," or "-s".
• Example: The words "running," "runs," and
"runner" might all be stemmed to "run."
However, "universal," "university," and
"universe" could all be stemmed to
"univers," which isn't a valid word and
conflates different meanings.
Lemmatization
• Definition: Lemmatization is a more sophisticated
process that reduces words to their "lemma," which is
the base or dictionary form of a word, always ensuring
the output is a valid word.
• Lemmatization uses a vocabulary (like WordNet) and
morphological analysis, often incorporating Part-of-
Speech (POS) tagging to understand the word's context
and return its correct base form.
• Example: "Running," "ran," and "runs" would all be
lemmatized to "run." Crucially, "better" would be
lemmatized to "good," recognizing its base form as an
adjective.
Stemming vs. Lemmatization – Key
Differences
Aspect Stemming Lemmatization
Dictionary-based with
Rule-based (removal of
Approach morphological and context
suffixes/prefixes)
analysis
Always yields valid
May produce non-words
Output dictionary words (e.g.,
(e.g., "argu" from "arguing")
"good" from "better")
Faster (simpler and Slower (requires lookups and
Speed
lightweight) POS tagging)
Less accurate, may over- or More accurate, context- and
Accuracy
under-stem grammar-aware
•Stemming is good for speed-focused applications (e.g., search engines).
•Lemmatization is better for semantic accuracy and linguistic correctness.
•In most modern NLP tasks, lemmatization is preferred, especially when meaning matters.
Regular Expression
A regular expression (regex) is a sequence of
characters that forms a search pattern, used for:
• Matching
• Extracting
• Replacing
• Validating
...text based on specific patterns.
Use Regular Expressions in NLP
and Programming
• Tokenizing text
• Searching for specific word patterns (e.g., emails, phone
numbers)
• Cleaning data (e.g., removing punctuation)
• Validating inputs (e.g., password rules, dates)
Finite automata
• Finite automata (FA), also known as finite state
automata (FSA), are abstract machines that play a
foundational role in NLP, particularly in tasks involving
patterns and regular languages.
• They are computational models used to recognize
patterns in sequences of input symbols
A Simple Finite Automaton
A Simple Finite Automaton
A Simple Finite Automaton
• (Q, Σ, q₀, F, δ)
• Q = set of all states={q0, q1, q2, q3}
• Σ = inputs={1,0}
• q₀ = start state / initial state={q0}
• F = set of final states={q3}
• δ = transition function from Q × Σ → Q 0 1
q0 q1 q3
q1 q0 q2
q2 q3 q1
q3 q2 q0
Finite State Automata (FSA)
designed to accept variations of the
word "baa+!",
specifically:
✅ Accepted Strings:
baa!
baaa!
baaaaa..!

❌ Rejected Strings:
ba!
b!
Transition Table
N-gram Model(N-Gram
Language Model
Exercises using Bi-Gram, Tri-gram & Four-gram)
Understanding language patterns through word sequences
Understanding N-
gram Models
Definition of
N-gram
Definition and Scope
An n-gram is a sequence of n contiguous
items from text or speech, such as words or
letters.

Applications in NLP
N-grams are essential for tasks like text
prediction, speech recognition, and machine
translation.

Probability Modeling
N-grams model sequence probabilities to help
predict or evaluate language data
computationally.
Conditional
Probability in N-
gram Models
Conditional Probability Formula
Conditional probability calculates the
likelihood of an event given a preceding event
using P(B|A) = P(A,B)/P(A).

Joint Probability Decomposition

Joint probability can be decomposed into P(A)
multiplied by P(B|A), simplifying complex
probability calculations.

Bigram Model Simplification

Bigram models predict a word based solely on
the immediately preceding word, capturing
useful context efficiently.
Multi-variable
Probability
Formulation
Extending N-gram Models
N-gram models extend to multiple variables to
capture deeper contextual relationships in
language sequences.

Probability Chain Formula

The multi-variable probability is expressed as
a chain of conditional probabilities capturing
sequence dependencies.

Applications and Challenges

Multi-variable models enhance speech
recognition and text generation but increase
computational complexity.
N-gram Model
• An n-gram is a contiguous sequence of n items from a given sample of text
or speech. The items can be phonemes, syllables, letters, words, or base
pairs according to the application. The n-grams typically are collected from
a text or speech corpus.
• Conditional Probability:
P(B∣A)=P(A,B)P(A)
• And
• P(A,B)=P(A)⋅P(B∣A)P(B∣A)=P(A)P (A,B) andP (A,B)=P(A)⋅P (B∣A)
• More variables:
P(A,B,C,D)=P(A)⋅P(B∣A)⋅P(C∣A,B)⋅P(D∣A,B,C)P (A,B,C,D)
• =P (A)⋅P (B∣A)⋅P (C∣A,B)⋅P (D∣A,B,C)
• P(X1, X2, ... , Xn) = P(X1) P(X2 | X1) P(X3 | X1, X2) ...
Chain Rule P(Xn | X1, ... , Xn-1)
• P(about five minutes from) = P(about) × P(five |
about) × P(minutes | about five) × P(from | about five
minutes)
Probability of words in
sentences

•P(W1, W2, ... , Wn) = Π P(Wi | W1, W2, ... , Wi-1) • If you want to compute
•Unigram (1-gram): No history is used. the probability of the
•Bigram (2-gram): One word history. sentence:
•Trigram (3-gram): Two words history. "I love Mumbai", you
•Four-gram (4-gram): Three words history. would calculate:
•Five-gram (5-gram): Four words history.
practical applications of N-grams

• Usually
Bi-gram (previous one
word),
Tri-gram (previous two
words),
Four-gram (previous three
words) are used.
Unigram (1-gram) Model
Unigram (1-gram): No history is
used.
“about five minutes from…..”
Assume in corpus dinner word is
present with highest probability.
Unigram doesn’t take into account
probabilities with previous words
like from, minutes.
Unigram will predict dinner.
“about five minutes from dinner”
Bigram-Based
Word
Prediction
Example
Sentences
and Word
Frequencies
Next Word
Prediction
Probabilities
• You are given the sentence:
"I liek college"
• The word "liek" is likely misspelled. Use a trigram
model to find the most probable correction from the
following candidates:
• like
• lick
• lie
• life
• Corpus
• I like college
• I like Henry
• Do I like college
• I do like Henry
• Henry likes college
• 1. Trigram Context
• We are interested in the trigram: "I [candidate] college"
• 2. Count Trigrams in Corpus
• From the corpus, we extract trigram frequencies:
• "I like college" → appears 2 times
• "I like Henry" → appears 1 time
• "I do like Henry" → "do like Henry" (not relevant here)
• "Do I like college" → "I like college" again
• "Henry likes college" → not relevant
• So:
• "I like college" = 2 occurrences
• "I lick college" = 0
• "I lie college" = 0
• "I life college" = 0
3. Apply Probabilities

Assuming raw frequency as probability:

Candidate Trigram Count Probability

like I like college 2 High
lick I lick college 0 Low
lie I lie college 0 Low
life I life college 0 Low

Answer:
The most probable correction is "like", giving the
corrected sentence:
"I like college"
[Link]
natural_language_processing.html

Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
19 pages
NLP Morphology: Tokenization & Analysis
No ratings yet
NLP Morphology: Tokenization & Analysis
9 pages
Word-Level Processing in NLP Morphology
No ratings yet
Word-Level Processing in NLP Morphology
36 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
8 pages
NLP Pipeline and Morphology Overview
No ratings yet
NLP Pipeline and Morphology Overview
21 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
3 pages
NLP Sentence Structure and Analysis Guide
No ratings yet
NLP Sentence Structure and Analysis Guide
52 pages
Understanding Tokenization and NLP Techniques
No ratings yet
Understanding Tokenization and NLP Techniques
22 pages
Word Level Analysis in NLP
No ratings yet
Word Level Analysis in NLP
18 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
52 pages
Morphology and Lemmatization in NLP
No ratings yet
Morphology and Lemmatization in NLP
31 pages
Introduction to Language Modeling in NLP
No ratings yet
Introduction to Language Modeling in NLP
74 pages
6 Morphology Stemming Lemmatization
No ratings yet
6 Morphology Stemming Lemmatization
19 pages
FSA and Regex in NLP Applications
No ratings yet
FSA and Regex in NLP Applications
16 pages
Hierarchical Levels of NLP Analysis
No ratings yet
Hierarchical Levels of NLP Analysis
47 pages
NPL 2
No ratings yet
NPL 2
20 pages
Word Level Analysis and Morphology
No ratings yet
Word Level Analysis and Morphology
54 pages
Unit 2 NLP
No ratings yet
Unit 2 NLP
16 pages
Morphological Models in NLP Syllabus
No ratings yet
Morphological Models in NLP Syllabus
71 pages
Understanding Affixes: Types & Examples
No ratings yet
Understanding Affixes: Types & Examples
62 pages
Understanding Morphological Analysis
100% (1)
Understanding Morphological Analysis
17 pages
Morphological Analysis in Linguistics
No ratings yet
Morphological Analysis in Linguistics
17 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
39 pages
Understanding Derivational Morphology
No ratings yet
Understanding Derivational Morphology
6 pages
Stemming and Lemmatization in NLP
No ratings yet
Stemming and Lemmatization in NLP
20 pages
NLP - Word Level Analysis
No ratings yet
NLP - Word Level Analysis
25 pages
Lexicon and Morphological Rules in NLP
No ratings yet
Lexicon and Morphological Rules in NLP
6 pages
N-Grams and Morphological Analysis in NLP
No ratings yet
N-Grams and Morphological Analysis in NLP
15 pages
Components and Concepts of NLP
No ratings yet
Components and Concepts of NLP
18 pages
NLP Exam Paper for TE Sem VII 2024
No ratings yet
NLP Exam Paper for TE Sem VII 2024
7 pages
NLP Tokenization and Morphology Guide
No ratings yet
NLP Tokenization and Morphology Guide
134 pages
Morphological and Lexical Analysis in NLP
No ratings yet
Morphological and Lexical Analysis in NLP
9 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
23 pages
Lexemes in Natural Language Processing
No ratings yet
Lexemes in Natural Language Processing
101 pages
Word Level Morphological Analysis in NLP
No ratings yet
Word Level Morphological Analysis in NLP
49 pages
NLP Syntax and Semantics Overview
No ratings yet
NLP Syntax and Semantics Overview
48 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
88 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
17 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
101 pages
NLP Mid Exam Question Bank 2025-26
No ratings yet
NLP Mid Exam Question Bank 2025-26
54 pages
Word Level Analysis in NLP
No ratings yet
Word Level Analysis in NLP
97 pages
Morpheme Mapping in Linguistics
No ratings yet
Morpheme Mapping in Linguistics
20 pages
Key Challenges in Natural Language Processing
No ratings yet
Key Challenges in Natural Language Processing
36 pages
Morphemes in NLP: Structure and Use
No ratings yet
Morphemes in NLP: Structure and Use
4 pages
Morphology NLP Notes
No ratings yet
Morphology NLP Notes
12 pages
Computer Engineering Morphology Guide
No ratings yet
Computer Engineering Morphology Guide
64 pages
Origins and Challenges of NLP
No ratings yet
Origins and Challenges of NLP
106 pages
Understanding Morphology and Its Types
No ratings yet
Understanding Morphology and Its Types
72 pages
W 916 W HCVNQVX 6 NDWQMge KAe 1 Q 2 Ua 88 Le 1 Q 2 Ua 88 L
No ratings yet
W 916 W HCVNQVX 6 NDWQMge KAe 1 Q 2 Ua 88 Le 1 Q 2 Ua 88 L
13 pages
Morphological Analysis and Cliticization
No ratings yet
Morphological Analysis and Cliticization
41 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
43 pages
NLP Basics: Tokenization, Stemming, Lemmatization
No ratings yet
NLP Basics: Tokenization, Stemming, Lemmatization
8 pages
Unit 2nd
No ratings yet
Unit 2nd
123 pages
NNN
No ratings yet
NNN
51 pages
Understanding NLP: Word Structure & Challenges
No ratings yet
Understanding NLP: Word Structure & Challenges
56 pages
Overview of Language Models in NLP
No ratings yet
Overview of Language Models in NLP
28 pages
Morphological Analysis in NLP
No ratings yet
Morphological Analysis in NLP
14 pages
Understanding NLP Techniques and Applications
No ratings yet
Understanding NLP Techniques and Applications
27 pages
English Word Formation Methods
No ratings yet
English Word Formation Methods
100 pages
Understanding Morphology Concepts
No ratings yet
Understanding Morphology Concepts
46 pages
Word Formation Processes in English
No ratings yet
Word Formation Processes in English
8 pages
Aymara Language Learning Module IV
No ratings yet
Aymara Language Learning Module IV
23 pages
Singular and Plural Nouns in Grammar
No ratings yet
Singular and Plural Nouns in Grammar
8 pages
Bangla Morphological Change Model
No ratings yet
Bangla Morphological Change Model
8 pages
Understanding Grammatical Articles
No ratings yet
Understanding Grammatical Articles
20 pages
Possessive Case Exercises for Learners
No ratings yet
Possessive Case Exercises for Learners
3 pages
Overview of Machine Translation Concepts
No ratings yet
Overview of Machine Translation Concepts
25 pages
Types of Possessive Pronouns
100% (4)
Types of Possessive Pronouns
57 pages
Understanding English Morphemes
100% (2)
Understanding English Morphemes
10 pages
NLP Mid-Term Exam Questions and Answers
100% (1)
NLP Mid-Term Exam Questions and Answers
12 pages
German Word Family Dictionary Guide
No ratings yet
German Word Family Dictionary Guide
347 pages
Nigerian Pidgin: Language or Dialect?
No ratings yet
Nigerian Pidgin: Language or Dialect?
20 pages
The Worlds Chief Languages (1949)
No ratings yet
The Worlds Chief Languages (1949)
673 pages
Distributed Morphology Overview
No ratings yet
Distributed Morphology Overview
34 pages
Evolution of English: From Old to Middle
No ratings yet
Evolution of English: From Old to Middle
3 pages
Understanding Predicate Structure Rules
No ratings yet
Understanding Predicate Structure Rules
24 pages
Quenya Vocabulary Word Lists
No ratings yet
Quenya Vocabulary Word Lists
14 pages
Mood in Bulgarian and Macedonian Languages
No ratings yet
Mood in Bulgarian and Macedonian Languages
12 pages
Understanding Zero Morphemes in English
No ratings yet
Understanding Zero Morphemes in English
11 pages
Evolution of the English Language
No ratings yet
Evolution of the English Language
17 pages
English vs. Mandarin: Learning Challenges
No ratings yet
English vs. Mandarin: Learning Challenges
230 pages
Theories on Language Origins and Study
No ratings yet
Theories on Language Origins and Study
28 pages
Inflectional vs Derivational Morphemes
83% (6)
Inflectional vs Derivational Morphemes
3 pages
Shina Language
100% (1)
Shina Language
7 pages
Word Formation in Generative Grammar
No ratings yet
Word Formation in Generative Grammar
148 pages
Understanding Word Forms and Inflection
50% (2)
Understanding Word Forms and Inflection
18 pages
Morphology Challenges in TESL
No ratings yet
Morphology Challenges in TESL
20 pages
A Reference Grammar - Final PDF
100% (1)
A Reference Grammar - Final PDF
204 pages

Word-Level Analysis in NLP Techniques

Uploaded by

Word-Level Analysis in NLP Techniques

Uploaded by

Module : 2

• Breaking down text into individual

• Process of breaking down a continuous stream of text

This refers to constructing words by joining

A morpheme is the Free morphemes can stand alone (e.g.,

Prefix: added before the root (e.g., re- in

Often changes the syntactic category (e.g., verb → noun).

Morpholo Adds new, non-grammatical meaning to the base.

gy Less predictable than inflectional morphology.

Different from compounding, which combines full words (e.g.,

•These suffixes transform verbs/adjectives into nouns.

Joint Probability Decomposition

Bigram Model Simplification

Probability Chain Formula

Applications and Challenges

Assuming raw frequency as probability:

Candidate Trigram Count Probability

You might also like