Natural Language
Processing
Lexical Analysis
CSCI372
Mariam Fakih
1
Outlines
• Introduction,
• Finite State Morphonology,
• Difficult Morphology and Lexical analysis,
• Paradigm- Based Lexical analysis,
• Conclusion
2
Introduction
• Lexical Analysis is the Techniques and mechanism
for performing text analysis at the level of the word,
• Words are the building blocks of natural language
texts.
• A basic task of lexical analysis is to relate
morphological variants to their lemma that lies in a
lemma dictionary:
–“blend” is the lemma, and “blending” is part
of the lexeme
• lexical analysis may be used for generation or
parsing
3
Stemming and lemmatization
• Stemming and lemmatization are methods used by search
engines and chatbots to analyze the meaning behind a word.
• Stemming uses the stem of the word, while lemmatization
uses the context in which the word is being used.
• Stemming is a text processing task in which you reduce
words to their root, which is the core part of a word. For
example,
– the words “helping” and “helper” share the root
“help.”
• A lemma is a word that represents a whole group of words,
and that group of words is called a lexeme.
4
Stemming and lemmatization
5
Filtering Stop Words
• Stop words are words that you want to ignore, so you
filter them out of your text when you’re processing it.
• Very common words like 'in', 'is', and 'an' are often
used as stop words since they don’t add a lot of meaning
to a text in and of themselves
6
Lexical analysis
• The lexical analysis in NLP deals with the study at the level of
words with respect to their lexical meaning and part-of-
speech (Noun, Adverb….)
• This level of linguistic processing utilizes a language's lexicon,
which is a collection of individual lexemes,
• Lexicon of a language means the collection of words and
phrases in a language.
• Lexical analysis is defined as the process of breaking down a
text into words, phrases, and other meaningful elements.
• It involves identifying and analyzing the structure of words.
7
Lexicon of stems and affixes
• Morphological parsing is the task of breaking a word
down into its component morphemes,
• Morphemes:
➢ Stem,
➢ Affixes: Prefixes and suffixes,
Example: Unhappiness ->Un- happi-ness
• Morphotactic: morpheme ordering model,
• Orthographic rules: spelling changes upon combination
(city + -s --> citys -> cities)
8
Finite State Morphonology
• Its the area of linguistics that deals with the relations and
interaction of morphology with phonology
• in Russian word final voiced obstruents become
voiceless—but they are spelled as if they stay as they are,
unvoiced.
• The plural affix can be pronounced in three different
ways, depending on the stem it attaches to: as /z/ in flags,
as /z/ in glasses and as /s/ in cats
• flie and fly are pronounced the same.
9
Finite State Morphonology
• The favored model for handling morphonology
in the orthography, or morphology-based
orthographic spelling variation, is a specific type
of finite state machine known as a finite state
transducer (FST).
• It is assumed that the reader is familiar with
finite state automata FSA.
10
Finite state Automata
<g l a s s #> = g l a s s.
<g l a s s ˆ s #> = g l a s s e s.
<f o x #> = f o x.
<f o x ˆ s> = f o x e s.
<c a t #> = c a t.
<c a t ˆ s #> = c a t s
This correspondence of ∧ to ε labels the transition from State
6 to State 7. 11
Finite State Morphonology
• FST Finite State Transducers
• An FST represents a set of pairs of strings (think of as input,
output pairs)
• { (walk, walk+V+PL), (walk, walk+N+SG), (walked,
walk+V+PAST) ...}
• A transducer function: maps input to zero or more outputs.
• Transduce(walk) --> {walk+V+PL, walk+N+SG}
–Can return multiple answers if ambiguity: e.g.
– if you don’t have POS-tagged input, “walk” could be the verb
“They walk to the store” versus the noun “I took a walk”.
12
Finite State Morphonology
• While an FSA recognizes (accept/reject) an
input expression, it doesn’t produce any
other output,
• An FST, on the other hand, in addition
produces an output expression,
• FSA is a recognizer,
• whereas an FST translates from one
expression to another.
13
Finite State Morphonology
• FSTs are used in the lexical analysis phase
of compilers to associate semantic value with
the discovered tokens
• Morphology is the study of the way words are
built up from smaller meaning-bearing units,
morphemes.
• • Two broad classes of morphemes: – The
stems: the “main” morpheme of the word,
supplying the main meaning, while – The affixes:
add “additional” meaning of various kinds.
14
Morphology rules
15
Morphology: Phenomena
•Inflection
• Derivation
• Compounding
• Cliticization
16
Morphology: Phenomena
• Inflection
17
Morphology: Phenomena
• Inflection
18
Derivational data model
• Derivational morphology: modify
root to a word of a different class
➢ derivational
derive -ation -al
19
Morphology: Phenomena
• Compounding baseball desktop
• Cliticization
20
Finite state Automata
• FSA inflection
• FSA Derivational
21
Morphological parsing
22
Difficult Morphology and Lexical
analysis,
• As NLP systems are increasingly multilingual, it becomes more
and more important to explore the challenges other languages
pose for finite state models:
• Isomorphism: parallelism in the organization of the phonic and
semantic aspects of a language
– Dream->dreamed, walk->walked…….
– Verb to be
• ‘I was’
• ‘you were’
• ‘he/she was’
• Contiguity: The contiguous sequences of related words thus
group together to form phrases or constituents
– Sing->sang
– Ring-> rang
23
Paradigm- Based Lexical analysis
• The presentation of a lexeme’s word-
forms as a paradigm provides an
alternative way of capturing word
structure that does not rely on either
isomorphism or contiguity.
24
Paradigm- Based Lexical analysis,
• A suitable representation language that has been used
extensively for paradigm-based morphology is the lexical
knowledge representation language DATR, which up until
now we have used to demonstrate finite state models.
• Using symbol classes, FSTs can assign stems and affixes to
categories, and encode operations over these categories.
• In this way, they capture classes of environments and
changes for morphonological rules, and morpheme orderings
that hold for classes of items, as well as selections when
there is a choice of affixes for a given feature.
25
Conclusion
• words are important units of information, and language-based applications
should include some mechanism for registering their structural properties.
• Finite state techniques have long been used to provide such a mechanism
because of their computational efficiency, and their invertibility: they can
be used both to generate morphologically complex forms from underlying
representations, and parse morphologically complex forms into underlying
representations.
• All three approaches find their way into the assumptions that underlie a
given model.
• An item and arrangement approach (I&A) views analysis as computing the
information conveyed by a word’s stem morpheme with that of its affix
morpheme. Finite state morphology (FSM) incorporates this view using
FSTs. This works well for the ‘ideal’ situation outlined above: looked is a
stem plus a suffix, and information that the word Lexical Analysis 33
conveys is simply a matter of computing the information conveyed by both
morphemes. 26
Conclusion
• Item and process approaches (I&P) account for the kind of stem and
affix variation that can happen inside a complex word, for example,
sing becomes sang when it is past tense, and a vowel is added to the
suffix −s when attached fox. The emphasis is on possible phonological
processes that are associated with affixation (or other morphological
operations), what is known as morphonology.
• Finally, in word and paradigm approaches (W&P), a lemma is
associated with a table, or paradigm, that associates a morphological
variant of the lemma with a morphosyntactic property set. So looked
occupies the cell in the paradigm that contains the pairing of LOOK
with {Past, Simple}. And by the same token sang occupies the
equivalent cell in the SING paradigm. Meaning is derived from the
definition of the cell, not the meaning of stem plus meaning of suffix,
hence no special status is given to affixes
27