0% found this document useful (0 votes)

11 views54 pages

Word Level Analysis and Morphology

Uploaded by

ronit.jariwala9755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views54 pages

Word Level Analysis and Morphology

Uploaded by

ronit.jariwala9755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Module 2 - Part 1

Word Level Analysis

2 Word Level Analysis

 Basic Terms: Tokenization, Stemming, Lemmatization

 Survey of English Morphology
 Inflectional Morphology, Derivational Morphology
 Regular expression with types
 Morphological Models: Dictionary lookup
 Finite state morphology
 Morphological parsing with FST (Finite State Transducer)
 Lexicon free FST Porter Stemmer algorithm
3 Word Level Analysis

 Grams and its variation: Bigram, Trigram; Simple (Unsmoothed) N-

grams
 N-gram Sensitivity to the Training Corpus
 Unknown Words: Open versus closed vocabulary tasks
 Evaluating N-grams: Perplexity
 Smoothing: Laplace Smoothing
 Good-Turing Discounting
4 Morphology
 Studies how words are constructed from sub
word units.

 The study of internal structure of words.

 “A writer is someone who writes, and a

stinger is something that stings. But fingers
don’t fing, grocers don’t groce, haberdashers
don’t haberdash, hammers don’t ham”
-Richard Lederer
5 Need for Morphological Analysis


Information retrieval: search
 Systems benefit from being able to search for
singular and plural forms of search terms
 Generally fairly easy in English
 Complications » Irregular plurals handled via
morphological rules
1. goose-> geese
2. fish -> fish
3. ox ->oxen
 Spelling rules needed
1. fox +PL -> foxes
2. fly +PL -> flies
6 Need for Morphological Analysis

 Listing all of the plural forms of English nouns,

all of the verb forms for a particular stem,
etc…is a waste of space (and time if the
entries are being made by hand).

 Suffixes are productive

 Situation is much worse in other languages,

 e.g. agglutinative languages like Turkish

7 Morpheme
 Morphology is the study of the structure and formation of
words.
 Its most important unit is the morpheme.
 Morpheme: Smallest meaningful unit in the language/ minimal
unit of grammatical analysis / minimal unit of meaning.
 Morpheme can not be broken down further in to the bits.
 Eg: cat ->cats
 Category
 Categorize
 Categorized
 Eg: overestimating
 Keyboard
 Cranberry (1 Morpheme)
8 Types of free Morpheme
 Free morphemes can standalone as individual words in the
language.
 Eg:pen ,bottle,vital,laugh
 Lexical Morphemes:
-free morphemes that carry content of our utterances
-noun ,verbs, adjective ,adverbs
-open class-add new members to these category
Eg:slang
 Functional Morphemes:
- helps to connect words together in the sentence
- Preposition ,conjunction and article
Bound Morphemes
9

 Can not stand alone as individual words

 Affixes – supply “additional” meanings

 Prefixes – precede the stem
 Suffixes – follow the stem
 Circumfixes – precede and follow the stem
 Infixes – inserted inside the stem

 Types of bound morphemes

1. Derivational Morphemes
2. Inflection Morphemes
10 Derivational Morphemes
 Helps to creates new words in the language
 Can change the lexical category from one to another
(parts of speech)
 Combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different
class, often with a meaning that’s hard to predict
exactly.
 Nominalization
 organize (V) + -ation =organization (new word with
different meaning)
 kill (V) + -er =killer (noun)
 silly (ADJ) + -ness =silliness(Noun)
 Pre + determine(V) = predetermine (new word)
 teach(V)+er=teacher(N)
11 Inflection Morphemes
 Serve grammatical role in the language
 Don't create new words in the language
 Don’t change the lexical category from one to
another (parts of speech)
 Inflection
 Combination of a word stem with a grammatical
morpheme, usually resulting a word of the same
class, and usually filling some syntactic function.
 There are only inflectional morphemes in the
English.
12 8 Inflectional Morphemes
1. Plural – s
2. Possessive – ‘s
3. 3rd person singular –s
4. -ed(past tense)
5. -ing(present participle)
6. -en(past participle)
7. -er(comparative)
8. -est(superlative)
Lemmatization
13
 Stemming and Lemmatization have been
developed in the 1960s.
 Text normalizing and text mining procedures in
the field of Natural Language Processing .
 Applied to adjust text, words, documents for
more processing. (Pre-processing)
 These are a widely used for tagging, SEO, Web
Search Result, and Information Retrieval.
 Text Normalizing means converting text into
to a more convenient, standard form.
Lemmatization
14
 Lemmatization is one of the most common
text pre-processing techniques used in
Natural Language Processing (NLP) and
machine learning.
 Lemmatization the task of determining that
two words have the same root, despite their
surface differences.
 For example,
 sang, sung, and sings are forms of the verb sing.
 The word sing is the common lemma of these
words, and a lemmatizer maps from all of these to
sing.
15 Lemmatization
 For grammatical reasons, documents are going to use
different forms of a word, such as organize, organizes,
and organizing.
 Additionally, there are families of derivationally related
words with similar meanings, such
as democracy, democratic, and democratization.
 In many situations, it seems as if it would be useful for a
search for one of these words to return documents that
contain another word in the set.
 The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related
forms of a word to a common base form. For instance:
 Eg: am, are, is =>be
car, cars, car's, cars' => car
Stemming Lemmatization
1. Stemming is faster because Lemmatization is slower as
16 it chops words without compared to stemming but it
knowing the context of the knows the context of the
word in given sentences. word before proceeding.
2. It is a rule-based approach. It is a dictionary-based
approach
3. Accuracy is less. Accuracy is more as
compared to Stemming.
4. When we convert any word Lemmatization always gives
into root-form then the dictionary meaning word
stemming may create the while converting into root-
non-existence meaning of a form.
word.
5. Stemming is preferred Lemmatization would be
when the meaning of the recommended when the
word is not important for meaning of the word is
analysis. important for analysis.
Example: Spam Detection Example: Question Answer

6. For Example: For Example:

“Studies” => “Studi” “Studies” => “Study”
17 Finite Automata

 The theory of automata provides efficient and convenient tools for

the representation of linguistic phenomena.
 The theory of automata plays a significant role in providing
solutions of many problems in natural language processing. For
example ,speech recognition, spelling correction, information
retrieval etc.
 Finite state methods are useful in processing natural language as
the modeling of information using rules has many advantages for
language modeling.
 Finite state automaton has a mathematical model which allows data
to be represented in a compacted form using finite state
automaton.
 Finite Automata(FA) is the simplest machine to recognize
[Link] is used to characterize a Regular Language, for
18 example: /baa+!/.
Also it is used to analyze and recognize Natural language
Expressions.
 It has a set of states and rules for moving from one state
to another but it depends upon the applied input symbol.
Based on the states and the set of rules the input string
can be either accepted or rejected. Basically, it is an
abstract model of a digital computer that reads an input
string and changes its internal state depending on the
current input symbol.
 For example, construct a DFA which accept a language of
all strings ending with ‘a’.
Given: Σ = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}
19 Formal languages
20 Automata and Languages
 An automaton is an abstract model of a computer
which reads an input string, and changes its
internal state depending on the current input
symbol.
 It can either accept or reject the input string.
 Every automaton defines a language (the set of
strings it accepts).
 Different automata define different language
classes:
 Finite-state automata define regular languages
 Pushdown automata define context-free languages
 Turing machines define recursively enumerable
languages
21
22
23 Finite State Automata (FSAs)
24 State Transition Diagram
25

Finite-State Methods
for Morphology
26 Finite-State
Morphological Parsing
Finite State Automata for
27
Morphology
28 Union: Merging Automata
Stem Changes
29
Finite Automata: Language
30
Recognizer

Fig : NFA for the words Boy and Bat

31
32 FSAs for derivational morphology
33 Recognition vs. Analysis
 FSAs can recognize (accept) a string, but
they don’t provide its internal structure.
 Transducer :is a machine that maps
(transduces) the input string into an output
string that encodes its structure:
34
Applications
FSTs are used for a variety of different applications:
 Word Inflections.
For example, pluralizing words (cat -> cats, dog -
> dogs, goose -> geese, etc.).
 Morphological Parsing
i.e., extracting the “properties” of a word
(e.g., computers -> computer
+ [Noun] + [Plural])
 Simple Word Translation,
e.g., translating US English to UK English.
 Simple commands made to a computer
35
36
37 Finite-State Morphological
Parsing
 We need at least the following to build a
morphological parser:
1. Lexicon: the list of stems and affixes, together with basic
information about them (Noun stem or Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word. E.g., the rule that English plural
morpheme follows the noun rather than preceding it.
3. Orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes
combine (e.g., the y→ie spelling rule changes city + -s to cities).
38
Construction of a Finite state
Lexicon
 A lexicon is a repository for words.
 The simplest one would consist of an explicit list of every word of the language.
 Incovenient or impossible!
 Computational lexicons are usually structured with
 a list of each of the stems
 Affixes of the language together with a representation of morphotactics telling us how they can fit
together.
 The most common way of modeling morphotactics is the finite-state automaton

Reg-noun Irreg-pl-noun Irreg-sg-noun plural

fox geese goose -s

fat sheep
sheep mouse
fog Mice
fardvark
39 Construction of a Finite state Lexicon

Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg

walk cut caught -ed -ed -ing -s

fry speak ate
talk sing eaten
impeach sang
spoken
40 Construction of a Finite state Lexicon
Construction of a Finite state
41 Lexicon
42 Finite-State Morphological Parsing

An FSA for another fragment of English derivational morphology

43 Finite-State Morphological Parsing
44 Finite - State Transducers
45 Finite State
Transducers(FST)
 It is like an FSA but defines regular
relations, not regular languages
 It has two alphabet sets
 It has a transition function relating
input to states
 It has an output function relating state
and input to output
 It can be used to recognize, generate,
translate or relate sets.
46 Finite - State Transducers
47 Morphological parsing with FST
 The objective of the morphological parsing is to produce output
lexicons for a single input lexicon
48 Morphological parsing with
FST

 Lexical: Representing a simple concatenation of morphemes

making up a word
 Surface: Representing the actual spelling of the final word.
49 Morphological Parsing with FST

 The FST is a multi-function device, and can be

viewed in the following ways:
 Translator: It reads one string on one tape and
outputs another string,
 Recognizer: It takes a pair of strings as two tapes
and accepts/rejects based on their matching.
 Generator: It outputs a pair of strings on two tapes
along with yes/no result based on whether they are
matching or not.
 Relater: It compares the relation between two sets
of strings available on two tapes.
50 M o rp h o lo g ic a lp a rs in g w ith F S T

 The composition is useful to convert a

FST as parser to FST as a generator.
51 Morphological Parsing with FST
52
53
54

NLP Morphology: Tokenization & Analysis
No ratings yet
NLP Morphology: Tokenization & Analysis
9 pages
M1.2 Morphology
No ratings yet
M1.2 Morphology
38 pages
Word-Level Processing in NLP Morphology
No ratings yet
Word-Level Processing in NLP Morphology
36 pages
Word Level Analysis in NLP
No ratings yet
Word Level Analysis in NLP
97 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
52 pages
Understanding Morphology in NLP
No ratings yet
Understanding Morphology in NLP
8 pages
Morphological and Lexical Analysis in NLP
No ratings yet
Morphological and Lexical Analysis in NLP
9 pages
Unit 2nd
No ratings yet
Unit 2nd
123 pages
2 Morphology and Finite-State Transducers
No ratings yet
2 Morphology and Finite-State Transducers
59 pages
Morphological Analysis of Words
No ratings yet
Morphological Analysis of Words
35 pages
Morphology and Finite-State Transducers
No ratings yet
Morphology and Finite-State Transducers
32 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
101 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
19 pages
Word Level Morphological Analysis in NLP
No ratings yet
Word Level Morphological Analysis in NLP
49 pages
Finite-State Morphological Parsing
No ratings yet
Finite-State Morphological Parsing
47 pages
Lexemes and Word Structure in NLP
No ratings yet
Lexemes and Word Structure in NLP
37 pages
Morphological Analysis in NLP
No ratings yet
Morphological Analysis in NLP
103 pages
Natural Language Processing Syllabus
No ratings yet
Natural Language Processing Syllabus
114 pages
Morphology Notes
No ratings yet
Morphology Notes
5 pages
Morphology and Finite-State Transducers
No ratings yet
Morphology and Finite-State Transducers
34 pages
NLP Sentence Structure and Analysis Guide
No ratings yet
NLP Sentence Structure and Analysis Guide
52 pages
Natural Language Processing Syllabus
No ratings yet
Natural Language Processing Syllabus
122 pages
Morphology NLP Notes
No ratings yet
Morphology NLP Notes
12 pages
Morphological Analysis in Language Syntax
No ratings yet
Morphological Analysis in Language Syntax
52 pages
Morphological Analysis and Cliticization
No ratings yet
Morphological Analysis and Cliticization
41 pages
NLP Basics: Tokenization, Stemming, Lemmatization
No ratings yet
NLP Basics: Tokenization, Stemming, Lemmatization
8 pages
Word-Level Analysis in NLP Techniques
No ratings yet
Word-Level Analysis in NLP Techniques
63 pages
Morphology
No ratings yet
Morphology
45 pages
Finite State Automata and Morphology Guide
No ratings yet
Finite State Automata and Morphology Guide
125 pages
Hierarchical Levels of NLP Analysis
No ratings yet
Hierarchical Levels of NLP Analysis
47 pages
Morphological Analysis in Linguistics
No ratings yet
Morphological Analysis in Linguistics
17 pages
Unit1 NLP Notes
No ratings yet
Unit1 NLP Notes
15 pages
NLP Syntax and Semantics Overview
No ratings yet
NLP Syntax and Semantics Overview
48 pages
Understanding Word Structure and Morphology
No ratings yet
Understanding Word Structure and Morphology
42 pages
Understanding Morphology in Language
No ratings yet
Understanding Morphology in Language
10 pages
NLP 2. Word Level Analysis
No ratings yet
NLP 2. Word Level Analysis
39 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
7 pages
NLP Slides Module2
No ratings yet
NLP Slides Module2
50 pages
NLP Slides Module2
No ratings yet
NLP Slides Module2
50 pages
NLP - Slides - Module2 - Slides 1 - 34
No ratings yet
NLP - Slides - Module2 - Slides 1 - 34
34 pages
Lexical Representations in NLP
No ratings yet
Lexical Representations in NLP
36 pages
Understanding Finite State Transducers
No ratings yet
Understanding Finite State Transducers
12 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
43 pages
Understanding English Morphology
No ratings yet
Understanding English Morphology
100 pages
Unit 2
No ratings yet
Unit 2
21 pages
Lexical Analysis in Natural Language Processing
No ratings yet
Lexical Analysis in Natural Language Processing
27 pages
6 Morphology Stemming Lemmatization
No ratings yet
6 Morphology Stemming Lemmatization
19 pages
Understanding English Morphology and NLP
No ratings yet
Understanding English Morphology and NLP
12 pages
NLP Word Structure and Analysis Techniques
No ratings yet
NLP Word Structure and Analysis Techniques
250 pages
Morphology and Finite-State Transducers
No ratings yet
Morphology and Finite-State Transducers
98 pages
Understanding Morphological Analysis
100% (1)
Understanding Morphological Analysis
17 pages
NLP Techniques and Linguistic Analysis
No ratings yet
NLP Techniques and Linguistic Analysis
33 pages
NLP - Slides - Module2 - Slides 1 - 18
No ratings yet
NLP - Slides - Module2 - Slides 1 - 18
18 pages
Excellent ?
No ratings yet
Excellent ?
3 pages
Vocabulary Study: Suffixes Explained
No ratings yet
Vocabulary Study: Suffixes Explained
1 page
A Nice Cup of Tea: Preparation Guide
No ratings yet
A Nice Cup of Tea: Preparation Guide
38 pages
Communication Skills Course Outline
No ratings yet
Communication Skills Course Outline
3 pages
Lesson Plan: Eight Parts of Speech
No ratings yet
Lesson Plan: Eight Parts of Speech
11 pages
English Revision Guide for Students
No ratings yet
English Revision Guide for Students
74 pages
Essential English Grammar Lesson Plan
No ratings yet
Essential English Grammar Lesson Plan
12 pages
Honors English 9 Curriculum Map 2008-09
No ratings yet
Honors English 9 Curriculum Map 2008-09
3 pages
Understanding Capacitors in Context
No ratings yet
Understanding Capacitors in Context
8 pages
Terms and Definitions of Word Formation
No ratings yet
Terms and Definitions of Word Formation
9 pages
Understanding Nutrition and Grammar
No ratings yet
Understanding Nutrition and Grammar
5 pages
MasterMind 2 Unit 11 Wordlist
No ratings yet
MasterMind 2 Unit 11 Wordlist
3 pages
Question Bank (Linguistic Exercises 1) - 1
No ratings yet
Question Bank (Linguistic Exercises 1) - 1
10 pages
Myanmar Short Stories: Literary Analysis
No ratings yet
Myanmar Short Stories: Literary Analysis
63 pages
GST 111 Exam Prep: Key Grammar Concepts
No ratings yet
GST 111 Exam Prep: Key Grammar Concepts
117 pages
Parts of Speech Exercises Worksheet
No ratings yet
Parts of Speech Exercises Worksheet
2 pages
Traditional vs Structural Grammar
100% (1)
Traditional vs Structural Grammar
16 pages
Modern English Grammar Overview
No ratings yet
Modern English Grammar Overview
10 pages
Grammar II: Semantics and Syntax Review
No ratings yet
Grammar II: Semantics and Syntax Review
5 pages
Vocabulary and Grammar Worksheets Guide
No ratings yet
Vocabulary and Grammar Worksheets Guide
6 pages
Northwest University, Kano: Faculty of Humanities
No ratings yet
Northwest University, Kano: Faculty of Humanities
8 pages
Understanding Adjectives and Their Types
100% (1)
Understanding Adjectives and Their Types
3 pages
English 1
No ratings yet
English 1
5 pages
Advanced English Course Syllabus ESOL 94F
No ratings yet
Advanced English Course Syllabus ESOL 94F
9 pages
Tips - The Portable Writer
No ratings yet
Tips - The Portable Writer
133 pages
NLP Techniques and Challenges MCQs
67% (3)
NLP Techniques and Challenges MCQs
11 pages
An Introduction To English Morphology-Famala
100% (1)
An Introduction To English Morphology-Famala
147 pages
Understanding Word Classes in English
No ratings yet
Understanding Word Classes in English
28 pages
Natural Language Processing Question Bank
No ratings yet
Natural Language Processing Question Bank
5 pages
Types of Autobiographies Explained
No ratings yet
Types of Autobiographies Explained
164 pages

Word Level Analysis and Morphology

Uploaded by

Word Level Analysis and Morphology

Uploaded by

Module 2 - Part 1

Word Level Analysis

 Basic Terms: Tokenization, Stemming, Lemmatization

 Grams and its variation: Bigram, Trigram; Simple (Unsmoothed) N-

 The study of internal structure of words.

 “A writer is someone who writes, and a

 Listing all of the plural forms of English nouns,

 Suffixes are productive

 Situation is much worse in other languages,

 e.g. agglutinative languages like Turkish

 Can not stand alone as individual words

 Affixes – supply “additional” meanings

 Types of bound morphemes

6. For Example: For Example:

 The theory of automata provides efficient and convenient tools for

Fig : NFA for the words Boy and Bat

Reg-noun Irreg-pl-noun Irreg-sg-noun plural

fox geese goose -s

Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg

walk cut caught -ed -ed -ing -s

An FSA for another fragment of English derivational morphology

 Lexical: Representing a simple concatenation of morphemes

 The FST is a multi-function device, and can be

 The composition is useful to convert a

You might also like