Chapter 3.
Morphology and
Finite-State Transducers
From: Chapter 3 of An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, by Daniel Jurafsky and James H. Martin
Background
• Morphology — knowledge of the meaningful components of words
• The problem of recognizing that foxes breaks down into the two
morphemes fox and -es is called morphological parsing.
• The technique of retrieval of stem or root word by removing prefixes
and suffixes: stemming
Morphology and FSTs 2
3.1 Survey of (Mostly) English Morphology
• Morphology is the study of the way words are built up from smaller
meaning-bearing units, morphemes.
• Two broad classes of morphemes:
– The stems: the “main” morpheme of the word, supplying the main
meaning, while
– The affixes: add “additional” meaning of various kinds.
• Affixes are further divided into prefixes, suffixes, infixes, and
circumfixes.
– Suffix: eat-s
– Prefix: un-buckle
– Circumfix: ge-sag-t (said) sagen (to say) (in German)
– Infix: hingi (borrow) humingi (the agent of an action) )in Philippine
language Tagalog)
Morphology and FSTs 3
3.1 Survey of (Mostly) English Morphology
• Circumfixes: Circumfixes are affixes that attach to both the beginning
and the end of a base word to create a new word. While true
circumfixes are rare in English, some linguistic scholars argue that
certain combinations of prefixes and suffixes can function similarly.
For example:
– "Un-" + "-ed": In words like "unraveled" or "untangled,".
• Infixes: Infixes are affixes that are inserted into the middle of a base
word to create a new word.
Morphology and FSTs 4
3.1 Survey of (Mostly) English Morphology
• Prefixes and suffixes are often called concatenative morphology.
• A number of languages have extensive non-concatenative
morphology
– Ablaut: A process involving vowel alternation within a root to indicate
grammatical information or create related words.
• Example: "sing" (base form), "sang" (past tense), "sung" (past participle).
– Reduplication: The repetition of all or part of a word to indicate plurality,
intensification, or other grammatical features.
• Example: Tagalog "lakad" (walk) → "lakad-lakad" (stroll).
– Suppletion: The use of entirely different morphemes to express related
meanings.
• Example: English "to be" → "am," "is," "are," "was," "were."
Morphology and FSTs 5
3.1 Survey of (Mostly) English Morphology
• Two broad classes of ways to form words from morphemes:
– Inflection: Inflection involves adding affixes to a word to indicate
grammatical information such as tense, number, case, gender, or mood.
– Derivation: Derivation involves adding affixes to a word to create a new
word with a different meaning or part of speech.
– Inflectional morphemes typically do not change the part of speech or
meaning of the word, while derivational morphemes often do.
– Example: In English, "-er" can be inflectional, as in "bigger"
(comparative), or derivational, as in "teacher" (noun derived from
"teach").
Morphology and FSTs 6
3.1 Survey of (Mostly) English Morphology
Inflectional Morphology
• In English, only nouns, verbs, and sometimes adjectives can be
inflected, and the number of affixes is quite small.
• Inflections of nouns in English:
– An affix marking plural,
• Suffix "-s": The most common plural marker in
English, added to the end of most nouns.
– "cat" → "cats", "dog" → "dogs", "book" → "books"
• Suffix "-es": Used for nouns ending in sibilant
sounds (s, sh, ch, x, z).
– "bus" → "buses", "box" → "boxes", "watch" → "watches"
• Suffix "-en": Very rarely used
– "child" → "children", "ox" → "oxen", "brother" →
"brethren“
– An affix marking possessive
• llama’s, children’s, llamas’, Euripides’ comedies
Morphology and FSTs 7
3.1 Survey of (Mostly) English Morphology
Inflectional Morphology
• Verbal inflection is more complicated than
nominal inflection.
– English has three kinds of verbs:
• Main verbs: Express the main action or state in a sentence,
– eat, sleep, impeach
• Modal verbs: Auxiliary verbs that express necessity,
possibility, permission, or ability
– can will, should
• Primary verbs: Act as both main and auxiliary verbs (be,
have, do).
– be, have, do
Morphology and FSTs 8
3.1 Survey of (Mostly) English Morphology
Inflectional Morphology
– Morphological forms of regular verbs
stem walk merge try map
-s form walks merges tries maps
-ing principle walking merging trying mapping
Past form or –ed participle walked merged tried mapped
– These regular verbs and forms are significant in the morphology of
English because of their majority and being productive.
Morphology and FSTs 9
3.1 Survey of (Mostly) English Morphology
Inflectional Morphology
– Morphological forms of irregular verbs
stem eat catch cut
-s form eats catches cuts
-ing principle eating catching cutting
Past form ate caught cut
–ed participle eaten caught cut
Morphology and FSTs 10
3.1 Survey of (Mostly) English Morphology
Derivational Morphology
• Nominalization in English:
– The formation of new nouns, often from verbs or adjectives
Suffix Base Verb/Adjective Derived Noun
-action computerize (V) computerization
-ee appoint (V) appointee
-er kill (V) killer
-ness fuzzy (A) fuzziness
– Adjectives derived from nouns or verbs
Suffix Base Noun/Verb Derived Adjective
-al computation (N) computational
-able embrace (V) embraceable
-less clue (A) clueless
Morphology and FSTs 11
3.1 Survey of (Mostly) English Morphology
Derivational Morphology
• Derivation in English is more complex than inflection because
– Generally less productive
• A nominalizing affix like –ation can not be added to absolutely every verb.
eatation(*)
– There are subtle and complex meaning differences among nominalizing
suffixes. For example, sincerity has a subtle difference in meaning from
sincereness.
Morphology and FSTs 12
Terminology
• Parsing English morphology
Meaning Example
SG Singular (N) Fox
PL Plural (N) Foxes
------ ------------------------------- --------------------------------------
1SG First-Person Singular I walk
2SG Second-Person SG You walk
3SG Third-Person SG He (She | It) walks
1PL First-Person Plural We walk
2PL Second-Person PL You walk
3PL Third-Person PL They walk
Morphology and FSTs 13
Morphological Parsing
• Parsing English morphology
Input Morphological parsed output
cats cat +N +PL
cat cat +N +SG
cities city +N +PL
geese goose +N +PL
goose (goose +N +SG) or (goose +V)
gooses goose +V +3SG She gooses the engine to make it start
merging merge +V +PRES-PART
caught (caught +V +PAST-PART) or (catch +V +PAST)
h o l ogical features
d mo r p
Stems an
Morphology and FSTs 14
Morphological Parsing
• We need at least the following to build a morphological parser:
1. Lexicon: the list of stems and affixes, together with basic information
about them (Noun stem or Verb stem, etc.)
2. Morphotactics: The rules and patterns governing the combination and
arrangement of morphemes within a word. It involves the study of how
morphemes are structured and ordered to create words in a particular
language. It involves: Ordering of Morphemes, Co-occurrence
Restrictions etc.
3. Orthographic rules: These spelling rules are used to model the changes
that occur in a word, usually when two morphemes combine (e.g., the
y→ie spelling rule changes city + -s to cities).
Morphology and FSTs 15
Morphological Parsing
Lexicon Morphotactics Orthographic Rules
Set of rules that decide
Stores basic information Set of rules to make
the changes to the
about a word decisions
spelling
Decide whether the word can City 🡪 Cities
appear/not appear City 🡪 Citys
Word is stem or affix?
before/after/in-between other Knife 🡪 Knives
words Knife 🡪 Knifes
If it is stem, whether a Use-full-ness 🡪
Use-full-ness
noun stem or a verb stem? Usefulness
If affix, whether a prefix,
full-use-ness
suffix, circumfix or infix?
Morphology and FSTs 16
Finite-State Automata for Morphology
Morphology and FSTs 17
3.2 Finite-State Morphological Parsing
The Lexicon and Morphotactics
• A lexicon is a repository for words.
– The simplest one would consist of an explicit list of every word of the language.
– Computational lexicons are usually structured with
• a list of each of the stems and
• Affixes of the language together with a representation of morphotactics telling us how
they can fit together.
– The most common way of modeling morphotactics is the finite-state automaton.
Reg-noun Irreg-pl-noun Irreg-sg-noun plural
fox geese goose -s
fat sheep sheep
fog Mice mouse
An FSA for English nominal inflection
Morphology and FSTs 18
3.2 Finite-State Morphological Parsing
The Lexicon and Morphotactics
Reg-noun Irreg-pl-noun Irreg-sg-noun plural
dog geese goose -s
cat
Morphology and FSTs 19
3.2 Finite-State Morphological Parsing
The Lexicon and Morphotactics
An FSA for English verbal inflection
Reg-verb-stem Irreg-verb-stem Irreg-past-verb past Past-part Pres-part 3sg
walk cut caught -ed -ed -ing -s
fry speak ate
talk sing eaten
impeach sang
spoken
Morphology and FSTs 20
3.2 Finite-State Morphological Parsing
The Lexicon and Morphotactics
• English derivational morphology is more complex than English
inflectional morphology, and so automata of modeling English
derivation tends to be quite complex.
big, bigger, biggest
cool, cooler, coolest, coolly
red, redder, reddest
clear, clearer, clearest, clearly, unclear, unclearly
happy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
An FSA for a fragment of English adjective real, unreal, really
Morphology #1
Morphology and FSTs 21
3.2 Finite-State Morphological Parsing
• The FSA#1 recognizes all the listed adjectives, and ungrammatical forms
like unbig, redly, and realest.
• Thus #1 is revised to become #2.
• The complexity is expected from English derivation.
An FSA for a fragment of English adjective
Morphology #2
Morphology and FSTs 22
3.2 Finite-State Morphological Parsing
An FSA for another fragment of English derivational
morphology
Morphology and FSTs 23
3.2 Finite-State Morphological Parsing
• We can now use these FSAs to
solve the problem of
morphological recognition:
– Determining whether an input
string of letters makes up a
legitimate English word or not
– We do this by taking the
morphotactic FSAs, and plugging
in each “sub-lexicon” into the FSA.
– The resulting FSA can then be
defined as the level of the
individual letter.
Morphology and FSTs 24
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• Given the input, for example, cats, we would like to produce cat +N +PL.
• Two-level morphology
• Representing a word as a correspondence between
– a lexical level
• Representing a simple concatenation of morphemes making up a word, and
– The surface level
• Representing the actual spelling of the final word.
• Morphological parsing is implemented by building mapping rules that maps
letter sequences like cats on the surface level into morpheme and features
sequence like cat +N +PL on the lexical level.
Morphology and FSTs 25
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• The automaton we use for performing the mapping between these two
levels is the finite-state transducer or FST.
– A transducer maps between one set of symbols and another;
– An FST does this via a finite automaton.
• Thus an FST can be seen as a two-tape automaton which recognizes or
generates pairs of strings.
• The FST has a more general function than an FSA:
– An FSA defines a formal language
– An FST defines a formal relation between sets of strings.
• Another view of an FST:
– A machine reads one string and generates another.
Morphology and FSTs 26
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• FST as recognizer:
– a transducer that takes a pair of strings as input and output accept if the
string-pair is in the string-pair language, and a reject if it is not.
• FST as generator:
– a machine that outputs pairs of strings of the language. Thus the output is
a yes or no, and a pair of output strings.
• FST as transducer:
– A machine that reads a string and outputs another string.
• FST as set relater:
– A machine that computes relation between sets.
Morphology and FSTs 27
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• A formal definition of FST:
– Q: a finite set of N states q0, q1,…, qN
– Σ: a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i : o; one symbol I from an input
alphabet I, and one symbol o from an output alphabet O, thus Σ ⊆ I×O. I
and O may each also include the epsilon symbol ε.
– q0: the start state
– F: the set of final states, F ⊆ Q
– δ(q, i:o): the transition function or transition matrix between states. Given
a state q ∈ Q and complex symbol i:o ∈ Σ, δ(q, i:o) returns a new state q’
∈ Q. δ is thus a relation from Q × Σ to Q.
Morphology and FSTs 28
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• FSAs are isomorphic to regular languages, FSTs are isomorphic to
regular relations.
• Regular relations are sets of pairs of strings, a natural extension of the
regular language, which are sets of strings.
• FSTs are closed under union, but generally they are not closed under
difference, complementation, and intersection.
• Two useful closure properties of FSTs:
– Inversion: If T maps from I to O, then the inverse of T, T-1 maps from O
to I.
– Composition: If T1 is a transducer from I1 to O1 and T2 a transducer from
I2 to O2, then T1 。 T2 maps from I1 to O2
Morphology and FSTs 29
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
• Inversion is useful because it makes it easy to convert a FST-as-parser into an
FST-as-generator.
• Composition is useful because it allows us to take two transducers than run in series
and replace them with one complex transducer.
– T1。T2(S) = T2(T1(S) )
Reg-noun Irreg-pl-noun Irreg-sg-noun
fox g o:e o:e s e goose
fat sheep sheep
fog m o:i u:εs:c e mouse
aardvark
A transducer for English nominal
number inflection Tnum
Morphology and FSTs 30
3.2 Finite-State Morphological Parsing
Morphological Parsing with FST
^: morpheme boundary
#: word boundary
A fleshed-out English nominal inflection FST
Tlex = Tnum。Tstems
Morphology and FSTs 31
3.2 Finite-State Morphological Parsing
Orthographic Rules and FSTs
• Spelling rules (or orthographic rules)
Name Description of Rule Example
Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and -ed make/making
E insertion e added after -s, -z, -x, -ch, -sh, before -s watch/watches
Y replacement -y changes to -ie before -s, -i before -ed try/tries
K insertion Verb ending with vowel + -c add -k panic/panicked
– These spelling changes can be thought as taking as input a simple concatenation of
morphemes and producing as output a slightly-modified concatenation of
morphemes.
Morphology and FSTs 32
3.2 Finite-State Morphological Parsing
Orthographic Rules and FSTs
• “insert an e on the surface tape just when the lexical tape has a
morpheme ending in x (or z, etc) and the next morphemes is -s”
Morphology and FSTs 33
3.2 Finite-State Morphological Parsing
Orthographic Rules and FSTs
The transducer for the E-insertion rule
Morphology and FSTs 34
3.3 Combining FST Lexicon and Rules
Morphology and FSTs 35
3.3 Combining FST Lexicon and Rules
Morphology and FSTs 36
3.3 Combining FST Lexicon and Rules
Morphology and FSTs 37
3.3 Combining FST Lexicon and Rules
• The power of FSTs is that the exact same cascade with the same state
sequences is used
– when machine is generating the surface form from the lexical tape, or
– When it is parsing the lexical tape from the surface tape.
• Parsing can be slightly more complicated than generation, because of
the problem of ambiguity.
– For example, foxes could be fox +V +3SG as well as fox +N +PL
Morphology and FSTs 38
3.4 Lexicon-Free FSTs: the Porter Stemmer
• One of the most widely used stemming algorithms is the simple and
efficient Porter (1980) algorithm, which is based on a series of simple
cascaded rewrite rules.
• Information retrieval
• The algorithm applies a series of rules to remove suffixes in a
step-by-step manner.
• The process consists of five steps, each handling different suffixes.
Morphology and FSTs 39
3.4 Lexicon-Free FSTs: the Porter Stemmer
• Main Rules Applied in Porter Stemming
• The algorithm works through five steps of suffix removal, applying
different rules in each step. Below are some common transformations:
Step 1a:
1. SSES → SS (e.g., "caresses" → "caress")
2. IES → I (e.g., "ponies" → "poni")
3. SS → SS (unchanged, e.g., "happiness" → "happiness")
4. S → (remove S if not preceded by a vowel-consonant pair)
Step 1b:
5. EED → EE (e.g., "agreed" → "agree")
6. ING → (remove if preceded by a vowel) (e.g., "running" → "run")
Morphology and FSTs 40
3.4 Lexicon-Free FSTs: the Porter Stemmer
• Main Rules Applied in Porter Stemming
• The algorithm works through five steps of suffix removal, applying
different rules in each step. Below are some common transformations:
Step 2:
1. TIONAL → TION (e.g., "national" → "nation")
2. MENT → (remove) (e.g., "enjoyment" → "enjoy")
Step 3:
3. ICATE → IC (e.g., "duplicate" → "duplic")
Step 4:
4. AL → (remove) (e.g., "formal" → "form")
Step 5:
5. E → (remove final 'e' if preceded by a vowel) (e.g., "hope" → "hop")
Morphology and FSTs 41
3.4 Lexicon-Free FSTs: the Porter Stemmer
• Use Cases of Porter Stemming
• Search Engines: Reduces word variations for better indexing (e.g.,
"running" and "runs" become "run").
• Text Mining & NLP: Helps in sentiment analysis and topic modeling.
• Spam Filtering: Detects patterns by reducing words to common forms.
Morphology and FSTs 42
Tokenization
• Tokenization:
– Tokenization is the process of breaking a text into smaller units called
tokens.
– These tokens can be words, sentences, or even characters, depending on
the level of tokenization.
Morphology and FSTs 43
Word Tokenization
• Word Tokenization:
– Word tokenization splits a sentence or text into individual words.
– This helps in text analysis, NLP, and machine learning applications.
• Example:
– Input: "Natural Language Processing is fun!“
– Word Tokens: ["Natural", "Language", "Processing", "is", "fun", "!"]
Morphology and FSTs 44
Sentence Tokenization
• Sentence Tokenization:
– Sentence tokenization splits a paragraph or text into individual
sentences.
– It is useful for text summarization, translation, and sentiment analysis.
• Example:
– Input: "Hello world. How are you? I'm learning NLP!“
– Sentence tokens: ["Hello world.", "How are you?", "I'm learning NLP!"]
Morphology and FSTs 45
Minimum Distance Edit
• Minimum Edit Distance, also known as Levenshtein Distance, is a
method used to measure the difference between two strings by
counting the minimum number of operations required to convert one
string into another.
• Allowed Operations
– Insertion → Add a character (e.g., kit → kits)
– Deletion → Remove a character (e.g., kits → kit)
– Substitution → Replace one character with another (e.g., kit → bit)
Morphology and FSTs 46
Minimum Distance Edit
• Example:
– The gap between the words intention and execution is five operations
– Insertion/ Deletion costs 1
– Substitution costs 2
Morphology and FSTs 47
Human Morphological Processing
• How Multi-Morphemic Words Are Represented in the Mind
• Do we store all word forms separately (walk, walks, walked)?
• Two competing hypotheses:
• Full Listing Hypothesis → All words are stored separately.
• Minimum Redundancy Hypothesis → Only base forms &
morphemes are stored
Morphology and FSTs 48
Evidence from Experiments
🔹 Speech Errors (Slips of the Tongue)
• Errors suggest morphemes can separate from their base forms.
– Screw looses (for screws loose)
– Easy enoughly (for easily enough)
🔹 Experimental Studies
– Stanners et al. (1979): Regular inflections (burned → burn) are stored
together, but some derived words (happiness → happy) are separate.
• Marslen-Wilson et al. (1994): Priming occurs only if derived words
retain meaning (government → govern, but department ≠ depart).
Morphology and FSTs 49
Morphological Family & Word Recognition
🔹 Morphological Family Size Effect
– Larger families (e.g., fear → fearless, fearful, fearsome) lead to faster
recognition.
– Supported by Baayen et al. (1997), De Jong et al. (2002).
🔹 Key Takeaways
– Inflectional morphology plays an active role in processing.
– Some derived forms are stored separately, while others prime their base.
– Morphological structure in the mind is neither fully listed nor fully
redundant.
Morphology and FSTs 50