0% found this document useful (0 votes)
8 views64 pages

Computer Engineering Morphology Guide

The document outlines the vision and mission of a Computer Engineering department, emphasizing the development of competent citizens and fostering research and innovation. It covers various aspects of English morphology, including word formation, inflectional and derivational morphology, and techniques like stemming and lemmatization. Additionally, it discusses regular expressions and finite state automata, highlighting their roles in computational linguistics and morphological parsing.

Uploaded by

22301011ankit
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views64 pages

Computer Engineering Morphology Guide

The document outlines the vision and mission of a Computer Engineering department, emphasizing the development of competent citizens and fostering research and innovation. It covers various aspects of English morphology, including word formation, inflectional and derivational morphology, and techniques like stemming and lemmatization. Additionally, it discusses regular expressions and finite state automata, highlighting their roles in computational linguistics and morphological parsing.

Uploaded by

22301011ankit
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Vision and Mission (Department)

Vision
To develop competent citizens who will be valuable contributors in the field of
Computer Engineering.

Mission
To create an environment which will stimulate research and innovation.
To provide students with comprehensive knowledge of the latest developments in
the field of Computer Engineering.
Module # 2

Word Level Analysis


CO : Students will be able to analyze the functioning of a language model for word-level text analysis.
English Morphology

► Morphology is the study of the way words are built up from smaller
meaning bearing units, morphemes.
► A morpheme is often defined as the minimal meaning-bearing unit in
a language.
► So for example the word fox consists of a single morpheme (the
morpheme fox) while the word cats consists of two: the morpheme
cat and the morpheme -s.

► There are two broad classes of Morphemes: Stems and Affixes.


English Morphology (Contd…)

► Affixes are further divided into prefixes, suffixes, infixes, and


circumfixes.
► Prefixes precede the stem, suffixes follow the stem, circumfixes do
both, and infixes are inserted inside the stem.
► For example, the word eats is composed of a stem eat and the suffix
-s.
► The word unbuckle is composed of a stem buckle and the prefix un-.
► English doesn’t have any good examples of circumfixes and infixes,
but many other languages do.
English Morphology (Contd…)

► A word can have more than one affix.


► For example, the word rewrites has the prefix re-, the stem write,
and the suffix -s.
► The word unbelievably has a stem (believe) plus three affixes (un-,
-able, and -ly).
► While English doesn’t tend to stack more than 4 or 5 affixes,
languages like Turkish can have words with 9 or 10 affixes, as we saw
above.
► Languages that tend to string affixes together like Turkish does are
called agglutinative languages.
► There are two broad (and partially overlapping) classes of ways to
form words from morphemes: inflection and derivation.
Inflectional Morphology

► Inflection is the combination of a word stem with a grammatical


morpheme, usually resulting in a word of the same class as the
original stem, and usually filling some syntactic function like
agreement.
► For example, English has the inflectional morpheme -s for marking
the plural on nouns, and the inflectional morpheme -ed for marking
the past tense on verbs.
► English has a relatively simple inflectional system; only nouns, verbs,
and sometimes adjectives can be inflected, and the number of
possible inflectional affixes is quite small.
Inflectional Morphology (Contd…)

► English nouns have only two kinds of inflection: an affix that marks
plural and an affix that marks possessive.

► While the regular plural is spelled -s after most nouns, it is spelled


–es after words ending in -s (ibis/ibises) , -z, (waltz/waltzes) -sh,
(thrush/thrushes) -ch, (finch/finches) and sometimes -x (box/boxes).
► Nouns ending in -y preceded by a consonant change the -y to -i
(butterfly/butterflies).
Inflectional Morphology (Contd…)

► The possessive suffix is realized by apostrophe + -s for regular


singular nouns (llama’s) and plural nouns not ending in -s (children’s)
and often by a lone apostrophe after regular plural nouns (llamas’)
and some names ending in -s or -z (Euripides’ comedies).
► English verbal inflection is more complicated than nominal inflection.
► First, English has three kinds of verbs; main verbs, (eat, sleep,
impeach), modal verbs (can, will, should), and primary verbs (be,
have, do)
► We will mostly be concerned with the main and primary verbs,
because it is these that have inflectional endings.
► Of these verbs a large class are regular, that is to say all verbs of this
class have the same endings marking the same functions.
Inflectional Morphology (Contd…)

► These regular verbs (e.g. walk, or inspect), have four morphological


forms, as follows:

► The irregular verbs are those that have some more or less
idiosyncratic (individual) forms of inflection
Derivational Morphology

► Derivation is the combination of a word stem with a grammatical


morpheme, usually resulting in a word of a different class, often with
a meaning hard to predict exactly.
► A very common kind of derivation in English is the formation of new
nouns, often from verbs or adjectives. This process is called
nominalization.
Derivational Morphology (Contd…)

► Adjectives can also be derived from nouns and verbs.


► Here are examples of a few suffixes deriving adjectives from nouns or
verbs.
Stemming and Lemmatization

► Stemming algorithms work by cutting off the end or the beginning of


the word, taking into account a list of common prefixes and suffixes
that can be found in an inflected word.
► This indiscriminate cutting can be successful in some occasions, but
not always, and that is why we affirm that this approach presents
some limitations.
Stemming and Lemmatization (Contd…)

► Lemmatization, on the other hand, takes into consideration the


morphological analysis of the words.
► To do so, it is necessary to have detailed dictionaries which the
algorithm can look through to link the form back to its lemma
Porter Stemmer

► STEMMING a simplified form of morphological analysis – simply find


the stem.
► Porter Stemmer is a simple rule-based algorithm for stemming

► Based on rules like:


► ATIONAL -> ATE (e.g., relational -> relate)

► The algorithm consists of five sets of rules, applied in order.


Porter Stemmer (Contd…)

► Definitions:
► CONSONANT: a letter other than A, E, I, O, U, and Y preceded by
consonant
► VOWEL: any other letter
► With this definition, all words are of the form:
► (C)(VC)m(V)
► C=string of one or more consonants (con+)
► V=string of one or more vowels
► M=measure of word or word part, when represented in the form (VC)
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)

► The rules are of the form: (condition) S1 -> S2


► Where S1 and S2 are suffixes

► Conditions:
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)
if the second or third rule of step 1b was used
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)
Porter Stemmer (Contd…)

► Drawbacks:
► gas (noun) → ga
► gases (plural) → gase
► gasses (verb, present tense) → gass
► gassing (verb, present continuous) → gass
► gaseous (adjective) → gaseou
Regular Expressions

► A regular expression is a formula in a special language that is used for


specifying simple classes of strings. A string is a sequence of symbols;
for the purpose of most text-based search techniques, a string is any
sequence of alphanumeric characters (letters, numbers, spaces, tabs,
and punctuation).
► For these purposes a space is just a character like any other, and we
represent it with the symbol .
► Formally, a regular expression is an algebraic notation for
characterizing a set of strings.
► Thus they can be used to specify search strings as well as to define a
language in a formal way.
Examples of Regular Expressions
Regular Expressions Regular Set

(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }

(0*10*) {1, 01, 10, 010, 0010, …}

(0 + ε)(1 + ε) {ε, 0, 1, 01}

It would be set of strings of a’s and b’s of any length which also
(a+b)*
includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}

It would be set of strings of a’s and b’s ending with the string abb i.e.
(a+b)*abb
{abb, aabb, babb, aaabb, ababb, …………..}

It would be set consisting of even number of 1’s which also includes an


(11)*
empty string i.e. {ε, 11, 1111, 111111, ……….}

It would be set of strings consisting of even number of a’s followed by


(aa)*(bb)*b odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab, aaaabbb,
…………..}

It would be string of a’s and b’s of even length that can be obtained by
(aa + ab + ba + bb)* concatenating any combination of the strings aa, ab, ba and bb
including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}
Properties of regular sets
• If we do the union of two regular sets then the resulting set would also
be regular.
• If we do the intersection of two regular sets then the resulting set
would also be regular.
• If we do the complement of regular sets, then the resulting set would
also be regular.
• If we do the difference of two regular sets, then the resulting set would
also be regular.
• If we do the reversal of regular sets, then the resulting set would also
be regular.
• If we take the closure of regular sets, then the resulting set would also
be regular.
• If we do the concatenation of two regular sets, then the resulting set
would also be regular.
Regular Expressions (Contd…)

► The simplest kind of regular expression is a sequence of simple


characters.

► Regular expressions are case sensitive.


► This means that the pattern /woodchucks/ will not match the string
Woodchucks.
Regular Expressions (Contd…)

► The brackets can be used with the dash ( - ) to specify any one character in
a range.

► The square brackets can also be used to specify what a single character
cannot be, by use of the caret ^.
► If the caret (^) is the first symbol after the open square brace, the
resulting pattern is negated.
► For example, the pattern /[^a]/ matches any single character (including
special characters) except a.
Regular Expressions (Contd…)

► The use of square brackets solves our capitalization problem for


woodchucks.
► But we still haven’t answered our original question; how do we
specify both woodchuck and woodchucks?
► We can’t use the square brackets, because while they allow us to say
‘s or S’, they don’t allow us to say ‘s or nothing’.
► For this we use the question-mark, which means ‘the preceding
character or nothing’.
Regular Expressions (Contd…)
► The set of operators that allow us to say things like “some number of ‘a’s”
are based on the asterisk or *, commonly called the Kleene * (pronounced
“cleany star”).
► The Kleene star means ‘zero or more occurrences of the immediately
previous character or regular expression’.
► Recall that the regular expression for an individual digit was /[0-9]/.
► So the regular expression for an integer (a string of digits) is /[0-9] [0-9]*/.
(Why isn’t it just ) /[0-9]*/?
► Sometimes it’s annoying to have to write the regular expression for digits
twice, so there is a shorter way to specify ‘at least one’ of some character.
► This is the Kleene +, which means ‘one or more of the previous character’.
► Thus the expression is /[0-9]+/ the normal way to specify ‘a sequence of
digits’.
Regular Expressions (Contd…)
► Anchors are special characters that anchor regular expressions to
particular places in a string.
► The most common anchors are the caret ^ and the dollar-sign $.
► The caret ^ matches the start of a line. The pattern /^The/ matches
the word ‘The’ only at the start of a line.
► Thus there are three uses of the caret ^:
► to match the start of a line,
► as a negation inside of square brackets, and
► just to mean a caret.
► The dollar sign $ matches the end of a line.
► So the pattern is a useful pattern (␣$) for matching a space at the
end of a line.
Regular Expressions (Contd…)
► If we want to search for either the string cat or the string dog.
► Since we can’t use the square-brackets to search for ‘cat or dog’ (why
not?) we need a new operator, the disjunction operator, also called the
pipe symbol |.
► The pattern matches /cat | dog/ either the string cat or the string dog.
► Sometimes we need to use this disjunction operator in the midst of a larger
sequence.
► How can we specify both guppy and guppies?
► We cannot simply say /guppy | ies/ , because that would match only the
strings guppy and ies. This is because sequences like take precedence over
the disjunction operator .
► In order to make the disjunction operator apply only to a specific pattern,
we need to use the parenthesis operators (and).
Regular Expressions (Contd…)

► Operator precedence
Finite State Automata

► The regular expression is more than just a convenient metalanguage


for text searching.
► A regular expression is one way of describing a finite-state automaton
(FSA).
► Finite-state automata are the theoretical foundation of a good deal
of the computational work.
► Any regular expression can be implemented as a finite-state
automaton.
Finite State Automata (Contd…)
Finite State Automata (Contd…)
Finite State Transducers

► A transducer maps between one representation and another: a


finite-state transducer, or FST, is a type of finite automaton which
maps between two sets of symbols.
► The FST thus has a more general function than an FSA; where an FSA
defines a formal language by defining a set of strings, an FST defines
a relation between sets of strings.
► Another way of looking at an FST is as a machine that reads one
string and generates another.
Finite State Transducers (Contd…)
► Here’s a summary of this fourfold way of thinking about transducers:
• FST as recognizer: a transducer that takes a pair of strings as input
and outputs; accept if the string-pair is in the string-pair language, and
reject if it is not.
• FST as generator: a machine that outputs pairs of strings of the
language. Thus, the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads a string and outputs another
string.
• FST as set relater: a machine that computes relations between sets.
Finite State Transducers (Contd…)

► Let’s now turn to the task of morphological parsing.


► Given the input cats, for instance, we’d like to output cat +N +PI,
telling us that cat is a plural noun.
► In the finite-state morphology paradigm that we use, we represent a
word as a correspondence between a lexical level, which represents a
concatenation of morphemes making up a word,
► and the surface level, which represents the concatenation of letters
making up the actual spelling of the word.
Finite State Transducers (Contd…)
► Transducers are similar to finite state machines. They have states ,
inputs and transition between states. They also have output.
Expand previous FST so that it performs the following
transformations:
Finite State Transducers


WFST
We have FST below, which accept the strings {cat,cap}
WFST
► A weighted finite state transducer takes probabilities into account
► WFST are used in many NLP applications, such as speech recognition.
► WFST includes weight for each transition. this weight for each transition
tells you how frequent this transition is.
N-Gram

► In the fields of computational linguistics and probability, an n-gram is


a contiguous sequence of n items from a given sample of text or
speech. The items can be phonemes, syllables, letters, words or base
pairs according to the application.
► The word N-gram approach to spelling error detection and correction
was proposed by Mays et al. (1991).
► The idea is to generate every possible misspelling of each word in a
sentence either just by typographical modifications (letter insertion,
deletion, substitution), or by including homophones as well, (and
presumably including the correct spelling), and then choosing the
spelling that gives the sentence the highest prior probability.
N-gram Sensitivity to the Training Corpus
► In speech recognition, input may be noisy and this can lead to wrong
speech-to-text conversion.
► N-gram models are used in machine translation to produce more natural
sentences in the target language.
► N-gram models are generally at word-level. It has also been used at
character level to do stemming, i.e. separate the root word from the
suffix.
N-gram Sensitivity to the Training Corpus
The N-gram model is statistical model, like other statistical model its dependent on
training corpus.
There are two implications
the probabilities often encode specific facts about a given training corpus
N-grams do better and better job of modeling the training corpus as we increase the
value of N
N-gram Sensitivity to the Training Corpus
• The longer the context on which we train the model, the more coherent
the sentences. In the unigram sentences, there is no coherent relation
between words or any sentence-final punctuation.

• The bigram sentences have some local word-to-word coherence


(especially if we consider that punctuation counts as a word).

• The trigram and quadrigram sentences are beginning to look a lot like
Shakespeare. Indeed, a careful investigation of the quadrigram sentences
shows that they look a little too much like Shakespeare.
N-gram
• We will use corpora to calculate probabilities
• We will calculate probabilities of

unigram 1 word units {salt, table, lock}


1 character units {t, p, s}

bigram 2 word units {salt and, table and, lock and}


2 character units {th, he, ho}

trigram 3 word units {salt and pepper, table and chairs, lock and
key}
3 character units {the, tho, thp}
Unknown Words: Open Versus Closed Vocabulary Tasks
Closed Vocabulary
Sometimes we have a language task in which we know all the words that can occur,
Closed vocabulary and hence we know the vocabulary size V in advance.
The closed vocabulary assumption is the assumption that we have such a lexicon and that
the test set can only contain words from this lexicon. The closed vocabulary task thus
assumes there are no unknown words.
The number of unseen words grows constantly, so we can’t possibly know in advance
exactly how many there are, and we’d like our model to do something reasonable with
them. We call these oov unseen events unknown words, or out of vocabulary (OOV)
words.
The percentage of OOV words that appear in the test set is called the OOV rate.
Unknown Words: Open Versus Closed Vocabulary Tasks
Open Vocabulary
An open vocabulary system is one in which we model these potential unknown words
in the test set by adding a pseudo-word called <UNK>.
We can train the probabilities of the unknown word model <UNK> as follows:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to the
unknown word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular
word in the training set.

An alternative that doesn’t require choosing a vocabulary is to replace the first


occurrence of every word type in the training data by <UNK>.
Evaluating n-Grams: Perplexity
❏ The best way to evaluate the performance of a language model is to embed it in an
application and measure the total performance of the application.
❏ Such end-to-end evaluation is called extrinsic evaluation, and also sometimes called in
vivo evaluation
❏ Extrinsic evaluation is the only way to know if a particular improvement in a component is
really going to help the task at hand.
❏ Thus, for speech recognition, we can compare the performance of two language models by
running the speech recognizer twice, once with each language model, and seeing which
gives the more accurate transcription.
Evaluating n-Grams: Perplexity
❏ End-to-end evaluation is often very expensive; evaluating a large speech recognition test set,
for example, takes hours or even days.
❏ Thus, we would like a metric that can be used to quickly evaluate potential improvements in a
language model.
❏ An intrinsic evaluation metric is one that measures the quality of a model independent of any
application. Perplexity is the most common intrinsic evaluation metric for N-gram language
models.
❏ While an (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement
in speech recognition performance (or any other end-to-end metric), it often correlates with
such improvements.
❏ Thus, it is commonly used as a quick check on an algorithm, and an improvement in perplexity
can then be confirmed by an end-to-end evaluation.
Evaluating n-Grams: Perplexity
The intuition of perplexity is that given two probabilistic models, the better model is the one
that has a tighter fit to the test data or that better predicts the details of the test data.
We can measure better prediction by looking at the probability the model assigns to the test
data; the better model will assign a higher probability to the test data.
the perplexity (sometimes called PP for short) of a language model on a test set is a function of
the probability that the language model assigns to that test set.
For a test set W = w1,W2 -..wn, the perplexity is the probability of the test set, normalized by
the number of words:

Note : Inverse in Eq. The higher the conditional


probability of the word sequence, the lower the
perplexity.
Thus, minimizing perplexity is equivalent to
maximizing the test set probability according to
the language model.
Smoothing
There is a major problem with the maximum likelihood estimation process we
have seen for training the parameters of an n-gram model.
This is the problem of sparse data data caused by the fact that our maximum
likelihood estimate was based on a particular set of training data.
For any n-gram that occurred a sufficient number of times, we might have a
good estimate of its probability.
But because any corpus is limited, some perfectly acceptable English word
sequences are bound to be missing from it.
This missing data means that the n-gram matrix for any given training corpus is
bound to have a very large number of cases of putative “zero probability
n-grams” that should really have some non-zero probability.
Furthermore, the MLE method also produces poor estimates when the counts
are non-zero but still small.
Smoothing
❏ We need a method which can help get better estimates for these zero or low frequency counts.
Zero counts turn out to cause another huge problem.
❏ The perplexity metric defined above requires that we compute the probability of each test
sentence. But if a test sentence has an n-gram that never appeared in the training set, the
maximum likelihood estimate of the probability for this N-gram, and hence for the whole test
sentence, will be zero!
❏ This means that in order to evaluate our language models, we need to modify the MLE method
to assign some non-zero probability to any n-gram, even one that was never observed in
training.
❏ For these reasons, we'll want to modify the maximum likelihood estimates for computing
n-gram probabilities, focusing on the n-gram events that we incorrectly assumed had zero
probability.
❏ We use the term smoothing for such modifications that address the poor estimates that are due
to variability in small data sets.
❏ The name comes from the fact that (looking ahead a bit) we will be shaving a little bit of
probability mass from the higher counts and piling it instead on the zero counts, making the
distribution a little less jagged.
Smoothing Techniques
Smoothing techniques
• Laplace Smoothing
• Good-Turing Discounting
• Backoff
Laplace Smoothing
● One simple way to do smoothing might be just to take our matrix of bigram
counts, before we normalize them into probabilities, and add 1 to all the
counts. This algorithm is called Laplace smoothing, or Laplace’s Law.
● For Unigram
○ Add 1 to every word (type) count to ge an adjusted count *c
○ Normalize by N(#tokens) + V (#types)
○ Original unigram probability P(Wi) = Ci/N
○ New unigram probability P(Wi) = C+1/N+V

You might also like