0% found this document useful (0 votes)
11 views54 pages

Word Level Analysis and Morphology

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views54 pages

Word Level Analysis and Morphology

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module 2 - Part 1

Word Level Analysis


2 Word Level Analysis

 Basic Terms: Tokenization, Stemming, Lemmatization


 Survey of English Morphology
 Inflectional Morphology, Derivational Morphology
 Regular expression with types
 Morphological Models: Dictionary lookup
 Finite state morphology
 Morphological parsing with FST (Finite State Transducer)
 Lexicon free FST Porter Stemmer algorithm
3 Word Level Analysis

 Grams and its variation: Bigram, Trigram; Simple (Unsmoothed) N-


grams
 N-gram Sensitivity to the Training Corpus
 Unknown Words: Open versus closed vocabulary tasks
 Evaluating N-grams: Perplexity
 Smoothing: Laplace Smoothing
 Good-Turing Discounting
4 Morphology
 Studies how words are constructed from sub
word units.

 The study of internal structure of words.

 “A writer is someone who writes, and a


stinger is something that stings. But fingers
don’t fing, grocers don’t groce, haberdashers
don’t haberdash, hammers don’t ham”
-Richard Lederer
5 Need for Morphological Analysis


ƒ Information retrieval: search
 Systems benefit from being able to search for
singular and plural forms of search terms
 Generally fairly easy in English
 Complications » Irregular plurals handled via
morphological rules
1. ‹ goose-> geese
2.‹ fish -> fish
3. ‹ox ->oxen
 Spelling rules needed
1. ‹fox +PL -> foxes ‹
2. fly +PL -> flies
6 Need for Morphological Analysis

 Listing all of the plural forms of English nouns,


all of the verb forms for a particular stem,
etc…is a waste of space (and time if the
entries are being made by hand).

 Suffixes are productive

 Situation is much worse in other languages,

 e.g. agglutinative languages like Turkish


7 Morpheme
 Morphology is the study of the structure and formation of
words.
 Its most important unit is the morpheme.
 Morpheme: Smallest meaningful unit in the language/ minimal
unit of grammatical analysis / minimal unit of meaning.
 Morpheme can not be broken down further in to the bits.
 Eg: cat ->cats
 Category
 Categorize
 Categorized
 Eg: overestimating
 Keyboard
 Cranberry (1 Morpheme)
8 Types of free Morpheme
 Free morphemes can standalone as individual words in the
language.
 Eg:pen ,bottle,vital,laugh
 Lexical Morphemes:
-free morphemes that carry content of our utterances
-noun ,verbs, adjective ,adverbs
-open class-add new members to these category
Eg:slang
 Functional Morphemes:
- helps to connect words together in the sentence
- Preposition ,conjunction and article
Bound Morphemes
9

 Can not stand alone as individual words

 Affixes – supply “additional” meanings


 Prefixes – precede the stem
 Suffixes – follow the stem
 Circumfixes – precede and follow the stem
 Infixes – inserted inside the stem

 Types of bound morphemes


1. Derivational Morphemes
2. Inflection Morphemes
10 Derivational Morphemes
 Helps to creates new words in the language
 Can change the lexical category from one to another
(parts of speech)
 Combination of a word stem with a grammatical
morpheme, usually resulting in a word of a different
class, often with a meaning that’s hard to predict
exactly.
 Nominalization
 organize (V) + -ation =organization (new word with
different meaning)
 kill (V) + -er =killer (noun)
 silly (ADJ) + -ness =silliness(Noun)
 Pre + determine(V) = predetermine (new word)
 teach(V)+er=teacher(N)
11 Inflection Morphemes
 Serve grammatical role in the language
 Don't create new words in the language
 Don’t change the lexical category from one to
another (parts of speech)
 Inflection
 Combination of a word stem with a grammatical
morpheme, usually resulting a word of the same
class, and usually filling some syntactic function.
 There are only inflectional morphemes in the
English.
12 8 Inflectional Morphemes
1. Plural – s
2. Possessive – ‘s
3. 3rd person singular –s
4. -ed(past tense)
5. -ing(present participle)
6. -en(past participle)
7. -er(comparative)
8. -est(superlative)
Lemmatization
13
 Stemming and Lemmatization have been
developed in the 1960s.
 Text normalizing and text mining procedures in
the field of Natural Language Processing .
 Applied to adjust text, words, documents for
more processing. (Pre-processing)
 These are a widely used for tagging, SEO, Web
Search Result, and Information Retrieval.
 Text Normalizing means converting text into
to a more convenient, standard form.
Lemmatization
14
 Lemmatization is one of the most common
text pre-processing techniques used in
Natural Language Processing (NLP) and
machine learning.
 Lemmatization the task of determining that
two words have the same root, despite their
surface differences.
 For example,
 sang, sung, and sings are forms of the verb sing.
 The word sing is the common lemma of these
words, and a lemmatizer maps from all of these to
sing.
15 Lemmatization
 For grammatical reasons, documents are going to use
different forms of a word, such as organize, organizes,
and organizing.
 Additionally, there are families of derivationally related
words with similar meanings, such
as democracy, democratic, and democratization.
 In many situations, it seems as if it would be useful for a
search for one of these words to return documents that
contain another word in the set.
 The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related
forms of a word to a common base form. For instance:
 Eg: am, are, is =>be
car, cars, car's, cars' => car
Stemming Lemmatization
1. Stemming is faster because Lemmatization is slower as
16 it chops words without compared to stemming but it
knowing the context of the knows the context of the
word in given sentences. word before proceeding.
2. It is a rule-based approach. It is a dictionary-based
approach
3. Accuracy is less. Accuracy is more as
compared to Stemming.
4. When we convert any word Lemmatization always gives
into root-form then the dictionary meaning word
stemming may create the while converting into root-
non-existence meaning of a form.
word.
5. Stemming is preferred Lemmatization would be
when the meaning of the recommended when the
word is not important for meaning of the word is
analysis. important for analysis.
Example: Spam Detection Example: Question Answer

6. For Example: For Example:


“Studies” => “Studi” “Studies” => “Study”
17 Finite Automata

 The theory of automata provides efficient and convenient tools for


the representation of linguistic phenomena.
 The theory of automata plays a significant role in providing
solutions of many problems in natural language processing. For
example ,speech recognition, spelling correction, information
retrieval etc.
 Finite state methods are useful in processing natural language as
the modeling of information using rules has many advantages for
language modeling.
 Finite state automaton has a mathematical model which allows data
to be represented in a compacted form using finite state
automaton.
 Finite Automata(FA) is the simplest machine to recognize
[Link] is used to characterize a Regular Language, for
18 example: /baa+!/.
Also it is used to analyze and recognize Natural language
Expressions.
 It has a set of states and rules for moving from one state
to another but it depends upon the applied input symbol.
Based on the states and the set of rules the input string
can be either accepted or rejected. Basically, it is an
abstract model of a digital computer that reads an input
string and changes its internal state depending on the
current input symbol.
 For example, construct a DFA which accept a language of
all strings ending with ‘a’.
Given: Σ = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}
19 Formal languages
20 Automata and Languages
 An automaton is an abstract model of a computer
which reads an input string, and changes its
internal state depending on the current input
symbol.
 It can either accept or reject the input string.
 Every automaton defines a language (the set of
strings it accepts).
 Different automata define different language
classes:
 Finite-state automata define regular languages
 Pushdown automata define context-free languages
 Turing machines define recursively enumerable
languages
21
22
23 Finite State Automata (FSAs)
24 State Transition Diagram
25

Finite-State Methods
for Morphology
26 Finite-State
Morphological Parsing
Finite State Automata for
27
Morphology
28 Union: Merging Automata
Stem Changes
29
Finite Automata: Language
30
Recognizer

Fig : NFA for the words Boy and Bat


31
32 FSAs for derivational morphology
33 Recognition vs. Analysis
 FSAs can recognize (accept) a string, but
they don’t provide its internal structure.
 Transducer :is a machine that maps
(transduces) the input string into an output
string that encodes its structure:
34
Applications
FSTs are used for a variety of different applications:
 Word Inflections.
For example, pluralizing words (cat -> cats, dog -
> dogs, goose -> geese, etc.).
 Morphological Parsing
i.e., extracting the “properties” of a word
(e.g., computers -> computer
+ [Noun] + [Plural])
 Simple Word Translation,
e.g., translating US English to UK English.
 Simple commands made to a computer
35
36
37 Finite-State Morphological
Parsing
 We need at least the following to build a
morphological parser:
1. Lexicon: the list of stems and affixes, together with basic
information about them (Noun stem or Verb stem, etc.)
2. Morphotactics: the model of morpheme ordering that explains
which classes of morphemes can follow other classes of
morphemes inside a word. E.g., the rule that English plural
morpheme follows the noun rather than preceding it.
3. Orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes
combine (e.g., the y→ie spelling rule changes city + -s to cities).
38
Construction of a Finite state
Lexicon
 A lexicon is a repository for words.
 The simplest one would consist of an explicit list of every word of the language.
 Incovenient or impossible!
 Computational lexicons are usually structured with
 a list of each of the stems
 Affixes of the language together with a representation of morphotactics telling us how they can fit
together.
 The most common way of modeling morphotactics is the finite-state automaton

Reg-noun Irreg-pl-noun Irreg-sg-noun plural

fox geese goose -s


fat sheep
sheep mouse
fog Mice
fardvark
39 Construction of a Finite state Lexicon

Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg

walk cut caught -ed -ed -ing -s


fry speak ate
talk sing eaten
impeach sang
spoken
40 Construction of a Finite state Lexicon
Construction of a Finite state
41 Lexicon
42 Finite-State Morphological Parsing

An FSA for another fragment of English derivational morphology


43 Finite-State Morphological Parsing
44 Finite - State Transducers
45 Finite State
Transducers(FST)
 It is like an FSA but defines regular
relations, not regular languages
 It has two alphabet sets
 It has a transition function relating
input to states
 It has an output function relating state
and input to output
 It can be used to recognize, generate,
translate or relate sets.
46 Finite - State Transducers
47 Morphological parsing with FST
 The objective of the morphological parsing is to produce output
lexicons for a single input lexicon
48 Morphological parsing with
FST

 Lexical: Representing a simple concatenation of morphemes


making up a word
 Surface: Representing the actual spelling of the final word.
49 Morphological Parsing with FST

 The FST is a multi-function device, and can be


viewed in the following ways:
 Translator: It reads one string on one tape and
outputs another string,
 Recognizer: It takes a pair of strings as two tapes
and accepts/rejects based on their matching.
 Generator: It outputs a pair of strings on two tapes
along with yes/no result based on whether they are
matching or not.
 Relater: It compares the relation between two sets
of strings available on two tapes.
50 M o rp h o lo g ic a lp a rs in g w ith F S T

 The composition is useful to convert a


FST as parser to FST as a generator.
51 Morphological Parsing with FST
52
53
54

You might also like