Module 2 - Part 1
Word Level Analysis
2 Word Level Analysis
Basic Terms: Tokenization, Stemming, Lemmatization
Survey of English Morphology
Inflectional Morphology, Derivational Morphology
Regular expression with types
Morphological Models: Dictionary lookup
Finite state morphology
Morphological parsing with FST (Finite State Transducer)
Lexicon free FST Porter Stemmer algorithm
3 Word Level Analysis
Grams and its variation: Bigram, Trigram; Simple (Unsmoothed) N-
grams
N-gram Sensitivity to the Training Corpus
Unknown Words: Open versus closed vocabulary tasks
Evaluating N-grams: Perplexity
Smoothing: Laplace Smoothing
Good-Turing Discounting
4 Morphology
Studies how words are constructed from sub
word units.
The study of internal structure of words.
“A writer is someone who writes, and a
stinger is something that stings. But fingers
don’t fing, grocers don’t groce, haberdashers
don’t haberdash, hammers don’t ham”
-Richard Lederer
5 Need for Morphological Analysis
Information retrieval: search
Systems benefit from being able to search for
singular and plural forms of search terms
Generally fairly easy in English
Complications » Irregular plurals handled via
morphological rules
1. goose-> geese
2. fish -> fish
3. ox ->oxen
Spelling rules needed
1. fox +PL -> foxes
2. fly +PL -> flies
6 Need for Morphological Analysis
Listing all of the plural forms of English nouns,
all of the verb forms for a particular stem,
etc…is a waste of space (and time if the
entries are being made by hand).
Suffixes are productive
Situation is much worse in other languages,
e.g. agglutinative languages like Turkish
7 Morpheme
Morphology is the study of the structure and formation of
words.
Its most important unit is the morpheme.
Morpheme: Smallest meaningful unit in the language/ minimal
unit of grammatical analysis / minimal unit of meaning.
Morpheme can not be broken down further in to the bits.
Eg: cat ->cats
Category
Categorize
Categorized
Eg: overestimating
Keyboard
Cranberry (1 Morpheme)
8 Types of free Morpheme
Free morphemes can standalone as individual words in the
language.
Eg:pen ,bottle,vital,laugh
Lexical Morphemes:
-free morphemes that carry content of our utterances
-noun ,verbs, adjective ,adverbs
-open class-add new members to these category
Eg:slang
Functional Morphemes:
- helps to connect words together in the sentence
- Preposition ,conjunction and article
Bound Morphemes
9
Can not stand alone as individual words
Affixes – supply “additional” meanings
Prefixes – precede the stem
Suffixes – follow the stem
Circumfixes – precede and follow the stem
Infixes – inserted inside the stem
Types of bound morphemes
1. Derivational Morphemes
2. Inflection Morphemes
10 Derivational Morphemes
Helps to creates new words in the language
Can change the lexical category from one to another
(parts of speech)
Combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different
class, often with a meaning that’s hard to predict
exactly.
Nominalization
organize (V) + -ation =organization (new word with
different meaning)
kill (V) + -er =killer (noun)
silly (ADJ) + -ness =silliness(Noun)
Pre + determine(V) = predetermine (new word)
teach(V)+er=teacher(N)
11 Inflection Morphemes
Serve grammatical role in the language
Don't create new words in the language
Don’t change the lexical category from one to
another (parts of speech)
Inflection
Combination of a word stem with a grammatical
morpheme, usually resulting a word of the same
class, and usually filling some syntactic function.
There are only inflectional morphemes in the
English.
12 8 Inflectional Morphemes
1. Plural – s
2. Possessive – ‘s
3. 3rd person singular –s
4. -ed(past tense)
5. -ing(present participle)
6. -en(past participle)
7. -er(comparative)
8. -est(superlative)
Lemmatization
13
Stemming and Lemmatization have been
developed in the 1960s.
Text normalizing and text mining procedures in
the field of Natural Language Processing .
Applied to adjust text, words, documents for
more processing. (Pre-processing)
These are a widely used for tagging, SEO, Web
Search Result, and Information Retrieval.
Text Normalizing means converting text into
to a more convenient, standard form.
Lemmatization
14
Lemmatization is one of the most common
text pre-processing techniques used in
Natural Language Processing (NLP) and
machine learning.
Lemmatization the task of determining that
two words have the same root, despite their
surface differences.
For example,
sang, sung, and sings are forms of the verb sing.
The word sing is the common lemma of these
words, and a lemmatizer maps from all of these to
sing.
15 Lemmatization
For grammatical reasons, documents are going to use
different forms of a word, such as organize, organizes,
and organizing.
Additionally, there are families of derivationally related
words with similar meanings, such
as democracy, democratic, and democratization.
In many situations, it seems as if it would be useful for a
search for one of these words to return documents that
contain another word in the set.
The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related
forms of a word to a common base form. For instance:
Eg: am, are, is =>be
car, cars, car's, cars' => car
Stemming Lemmatization
1. Stemming is faster because Lemmatization is slower as
16 it chops words without compared to stemming but it
knowing the context of the knows the context of the
word in given sentences. word before proceeding.
2. It is a rule-based approach. It is a dictionary-based
approach
3. Accuracy is less. Accuracy is more as
compared to Stemming.
4. When we convert any word Lemmatization always gives
into root-form then the dictionary meaning word
stemming may create the while converting into root-
non-existence meaning of a form.
word.
5. Stemming is preferred Lemmatization would be
when the meaning of the recommended when the
word is not important for meaning of the word is
analysis. important for analysis.
Example: Spam Detection Example: Question Answer
6. For Example: For Example:
“Studies” => “Studi” “Studies” => “Study”
17 Finite Automata
The theory of automata provides efficient and convenient tools for
the representation of linguistic phenomena.
The theory of automata plays a significant role in providing
solutions of many problems in natural language processing. For
example ,speech recognition, spelling correction, information
retrieval etc.
Finite state methods are useful in processing natural language as
the modeling of information using rules has many advantages for
language modeling.
Finite state automaton has a mathematical model which allows data
to be represented in a compacted form using finite state
automaton.
Finite Automata(FA) is the simplest machine to recognize
[Link] is used to characterize a Regular Language, for
18 example: /baa+!/.
Also it is used to analyze and recognize Natural language
Expressions.
It has a set of states and rules for moving from one state
to another but it depends upon the applied input symbol.
Based on the states and the set of rules the input string
can be either accepted or rejected. Basically, it is an
abstract model of a digital computer that reads an input
string and changes its internal state depending on the
current input symbol.
For example, construct a DFA which accept a language of
all strings ending with ‘a’.
Given: Σ = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}
19 Formal languages
20 Automata and Languages
An automaton is an abstract model of a computer
which reads an input string, and changes its
internal state depending on the current input
symbol.
It can either accept or reject the input string.
Every automaton defines a language (the set of
strings it accepts).
Different automata define different language
classes:
Finite-state automata define regular languages
Pushdown automata define context-free languages
Turing machines define recursively enumerable
languages
21
22
23 Finite State Automata (FSAs)
24 State Transition Diagram
25
Finite-State Methods
for Morphology
26 Finite-State
Morphological Parsing
Finite State Automata for
27
Morphology
28 Union: Merging Automata
Stem Changes
29
Finite Automata: Language
30
Recognizer
Fig : NFA for the words Boy and Bat
31
32 FSAs for derivational morphology
33 Recognition vs. Analysis
FSAs can recognize (accept) a string, but
they don’t provide its internal structure.
Transducer :is a machine that maps
(transduces) the input string into an output
string that encodes its structure:
34
Applications
FSTs are used for a variety of different applications:
Word Inflections.
For example, pluralizing words (cat -> cats, dog -
> dogs, goose -> geese, etc.).
Morphological Parsing
i.e., extracting the “properties” of a word
(e.g., computers -> computer
+ [Noun] + [Plural])
Simple Word Translation,
e.g., translating US English to UK English.
Simple commands made to a computer
35
36
37 Finite-State Morphological
Parsing
We need at least the following to build a
morphological parser:
1. Lexicon: the list of stems and affixes, together with basic
information about them (Noun stem or Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word. E.g., the rule that English plural
morpheme follows the noun rather than preceding it.
3. Orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes
combine (e.g., the y→ie spelling rule changes city + -s to cities).
38
Construction of a Finite state
Lexicon
A lexicon is a repository for words.
The simplest one would consist of an explicit list of every word of the language.
Incovenient or impossible!
Computational lexicons are usually structured with
a list of each of the stems
Affixes of the language together with a representation of morphotactics telling us how they can fit
together.
The most common way of modeling morphotactics is the finite-state automaton
Reg-noun Irreg-pl-noun Irreg-sg-noun plural
fox geese goose -s
fat sheep
sheep mouse
fog Mice
fardvark
39 Construction of a Finite state Lexicon
Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg
walk cut caught -ed -ed -ing -s
fry speak ate
talk sing eaten
impeach sang
spoken
40 Construction of a Finite state Lexicon
Construction of a Finite state
41 Lexicon
42 Finite-State Morphological Parsing
An FSA for another fragment of English derivational morphology
43 Finite-State Morphological Parsing
44 Finite - State Transducers
45 Finite State
Transducers(FST)
It is like an FSA but defines regular
relations, not regular languages
It has two alphabet sets
It has a transition function relating
input to states
It has an output function relating state
and input to output
It can be used to recognize, generate,
translate or relate sets.
46 Finite - State Transducers
47 Morphological parsing with FST
The objective of the morphological parsing is to produce output
lexicons for a single input lexicon
48 Morphological parsing with
FST
Lexical: Representing a simple concatenation of morphemes
making up a word
Surface: Representing the actual spelling of the final word.
49 Morphological Parsing with FST
The FST is a multi-function device, and can be
viewed in the following ways:
Translator: It reads one string on one tape and
outputs another string,
Recognizer: It takes a pair of strings as two tapes
and accepts/rejects based on their matching.
Generator: It outputs a pair of strings on two tapes
along with yes/no result based on whether they are
matching or not.
Relater: It compares the relation between two sets
of strings available on two tapes.
50 M o rp h o lo g ic a lp a rs in g w ith F S T
The composition is useful to convert a
FST as parser to FST as a generator.
51 Morphological Parsing with FST
52
53
54