0% found this document useful (0 votes)
103 views34 pages

Module Ii

Module II covers word-level and syntactic analysis in Natural Language Processing (NLP), focusing on the use of regular expressions for text search and the properties of regular sets. It explains finite state automata, including deterministic and non-deterministic types, and discusses morphological parsing and syntactic analysis, detailing the requirements for building a morphological parser and the concept of parsing. The module emphasizes the importance of grammar in defining syntactic structures and the relationship between finite automata, regular grammars, and regular expressions.

Uploaded by

PRABHAKAR K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views34 pages

Module Ii

Module II covers word-level and syntactic analysis in Natural Language Processing (NLP), focusing on the use of regular expressions for text search and the properties of regular sets. It explains finite state automata, including deterministic and non-deterministic types, and discusses morphological parsing and syntactic analysis, detailing the requirements for building a morphological parser and the concept of parsing. The module emphasizes the importance of grammar in defining syntactic structures and the relationship between finite automata, regular grammars, and regular expressions.

Uploaded by

PRABHAKAR K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MODULE-II

WORD LEVEL AND SYNTACTIC ANALYSIS


A Word-level Analysis Task is a Linguistic Analysis Task that requires inferences about the
Word Mentions in a Linguistic Expression. AKA: Word-level Analysis, Bag-of-Words
Analysis. Context: Input: A Linguistic Expression.
Syntactic analysis, also referred to as syntax analysis or parsing, is the process of analyzing
natural language with the rules of a formal grammar. Grammatical rules are applied to categories
and groups of words, not individual words.

NLP - WORD LEVEL ANALYSIS


In this Module, we will understand world level analysis in Natural Language

Processing. Regular Expressions

A regular expression (RE) is a language for specifying text search strings. RE helps us to match
or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular
expressions are used to search texts in UNIX as well as in MS WORD in identical way. We have
various search engines using a number of RE features.

Properties of Regular Expressions

Followings are some of the important properties of RE −

∙ American Mathematician Stephen Cole Kleene formalized the Regular Expression


language.
∙ RE is a formula in a special language, which can be used for specifying simple classes of
strings, a sequence of symbols. In other words, we can say that RE is an algebraic
notation for characterizing a set of strings.
∙ Regular expression requires two things, one is the pattern that we wish to search and other
is a corpus of text from which we need to search.

Mathematically, A Regular Expression can be defined as follows −

∙ε is a Regular Expression, which indicates that the language is having an empty string.
∙ φ is a Regular Expression which denotes that it is an empty language.

∙ If X and Y are Regular Expressions, then


o X, Y
o X.Y(Concatenation of XY)
o X+Y (Union of X and Y)
o X*, Y* (Kleen Closure of X and Y)
are also regular expressions.

∙ If
a string is derived from above rules then that would also be a regular expression.
Examples of Regular Expressions

The following table shows a few examples of Regular Expressions −

Regular Expressions Regular Set


(0 + 10*) {0, 1, 10, 100, 1000, 10000, … }
(0*10*) {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) {ε, 0, 1, 01}

(a+b)* It would be set of strings of a’s and b’s of any length which also includes the null string
i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}

(a+b)*abb It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb,
babb, aaabb, ababb, …………..}

(11)* It would be set consisting of even number of 1’s which also includes an empty string i.e.
{ε, 11, 1111, 111111, ……….}
It would be set of strings consisting of even number of a’s followed by
(aa)*(bb)*b aaaab, aaaabbb, …………..}
It would be string of a’s and b’s of even length
that can be obtained by concatenating any
(aa + ab + ba + bb)* combination of the strings aa, ab, ba and bb
odd number of b’s i.e. {b, aab, aabbb, aabbbbb, including null i.e. {aa, ab, ba, bb, aaab, aaba,
…………..}

Regular Sets & Their Properties

It may be defined as the set that represents the value of the regular expression and consists
specific properties.

Properties of regular sets

∙ Ifwe do the union of two regular sets then the resulting set would also be regula. ∙ If we
do the intersection of two regular sets then the resulting set would also be regular. ∙ If we do
the complement of regular sets, then the resulting set would also be regular. ∙ If we do the
difference of two regular sets, then the resulting set would also be regular. ∙ If we do the
reversal of regular sets, then the resulting set would also be regular. ∙ If we take the closure
of regular sets, then the resulting set would also be regular. ∙ If we do the concatenation of
two regular sets, then the resulting set would also be regular.
Finite State Automata

The term automata, derived from the Greek word "αὐτόματα" meaning "self-acting", is the plural
of automaton which may be defined as an abstract self-propelled computing device that follows a
predetermined sequence of operations automatically.
An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State
automata (FSA).

Mathematically, an automaton can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

∙Q is a finite set of states.


∙Σ is a finite set of symbols, called the alphabet of the automaton.
∙δ is the transition function
∙ q0 is the initial state from where any input is processed (q0 ∈ Q).
∙F is a set of final state/states of Q (F ⊆ Q).

Relation between Finite Automata, Regular Grammars and Regular Expressions

Following points will give us a clear view about the relationship between
finite automata, regular grammars and regular expressions −

∙ As we know that finite state automata are the theoretical foundation of computational work
and regular expressions is one way of describing them.
∙ We can say that any regular expression can be implemented as FSA and any FSA can be
described with a regular expression.
∙ On the other hand, regular expression is a way to characterize a kind of language called
regular language. Hence, we can say that regular language can be described with the help
of both FSA and regular expression.
∙ Regular grammar, a formal grammar that can be right-regular or left-regular, is another
way to characterize regular language.

Following diagram shows that finite automata, regular expressions and regular grammars are the
equivalent ways of describing regular languages.
Types of Finite State Automation (FSA)

Finite state automation is of two types. Let us see what the types are.
Deterministic Finite automation (DFA)

It may be defined as the type of finite automation wherein, for every input symbol we can
determine the state to which the machine will move. It has a finite number of states that is why
the machine is called Deterministic Finite Automaton (DFA).

Mathematically, a DFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

∙Q is a finite set of states.


∙Σ is a finite set of symbols, called the alphabet of the automaton.
∙δ is the transition function where δ: Q × Σ → Q .
∙ q0 is the initial state from where any input is processed (q0 ∈ Q).
∙F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically, a DFA can be represented by diagraphs called state


diagrams where −

∙ The states are represented by vertices.


∙ The transitions are shown by labeled arcs.
∙ The initial state is represented by an empty incoming arc.
∙ The final state is represented by double circle.

Example of DFA

Suppose a DFA be
∙Q = {a, b, c},
∙Σ = {0, 1},
∙ q0 = {a},
∙F = {c},
∙ Transition function δ is shown in the table as follows −

Current State Next State for Input 0 Next State for Input 1
AaB
BbA
CcC

The graphical representation of this DFA would be as follows −

Non-deterministic Finite Automation (NDFA)


It may be defined as the type of finite automation where for every input symbol we cannot
determine the state to which the machine will move i.e. the machine can move to any
combination of the states. It has a finite number of states that is why the machine is called Non-
deterministic Finite Automation (NDFA).

Mathematically, NDFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

∙Q is a finite set of states.


∙Σ is a finite set of symbols, called the alphabet of the automaton.
∙δ :-is the transition function where δ: Q × Σ → 2 Q.
∙ q0 :-is the initial state from where any input is processed (q0 ∈ Q).
∙F :-is a set of final state/states of Q (F ⊆ Q).

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs


called state diagrams where −
∙ The states are represented by vertices.
∙ The transitions are shown by labeled arcs.
∙ The initial state is represented by an empty incoming arc.
∙ The final state is represented by double circle.

Example of NDFA

Suppose a NDFA be

∙Q = {a, b, c},
∙Σ = {0, 1},
∙ q0 = {a},
∙F = {c},
∙ Transitionfunction δ is shown in the table as follows −
Current State Next State for Input 0 Next State for Input 1
A a, b B
B C a, c
C b, c C

The graphical representation of this NDFA would be as follows −

Morphological Parsing
The term morphological parsing is related to the parsing of morphemes. We can define
morphological parsing as the problem of recognizing that a word breaks down into smaller
meaningful units called morphemes producing some sort of linguistic structure for it. For
example, we can break the word foxes into two, fox and -es. We can see that the word foxes, is
made up of two morphemes, one is fox and other is -es.

In other sense, we can say that morphology is the study of −


∙ The formation of words.
∙ The origin of the words.
∙ Grammatical forms of the words.
∙ Use of prefixes and suffixes in the formation of words.
∙ How parts-of-speech (PoS) of a language are formed.

Types of Morphemes

Morphemes, the smallest meaning-bearing units, can be divided into two


types −

∙ Stems

∙ Word Order

Stems
It is the core meaningful unit of a word. We can also say that it is the root of the word. For
example, in the word foxes, the stem is fox.

∙ Affixes− As the name suggests, they add some additional meaning and
grammatical functions to the words. For example, in the word foxes,
the affix is − es.

Further, affixes can also be divided into following four types −

o Prefixes − As the name suggests, prefixes precede the stem. For


example, in the word unbuckle, un is the prefix.
o Suffixes − As the name suggests, suffixes follow the stem. For
example, in the word cats, -s is the suffix.
o Infixes − As the name suggests, infixes are inserted inside the
stem. For example, the word cupful, can be pluralized as cupsful
by using -s as the infix.
o Circumfixes − They precede and follow the stem. There are very
less examples of circumfixes in English language. A very
common example is ‘A-ing’ where we can use -A precede and -
ing follows the stem.

Word Order

The order of the words would be decided by morphological parsing. Let us


now see the requirements for building a morphological parser −

Lexicon
The very first requirement for building a morphological parser is lexicon, which includes the list
of stems and affixes along with the basic information about them. For example, the information
like whether the stem is Noun stem or Verb stem, etc.

Morphotactics

It is basically the model of morpheme ordering. In other sense, the model explaining which
classes of morphemes can follow other classes of morphemes inside a word. For example, the
morphotactic fact is that the English plural morpheme always follows the noun rather than
preceding it.

Orthographic rules

These spelling rules are used to model the changes occurring in a word. For example, the rule of
converting y to ie in word like city+s = cities not citys.

Natural Language Processing - Syntactic Analysis

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this
phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis
checks the text for meaningfulness comparing to the rules of formal grammar. For example, the
sentence like “hot ice-cream” would be rejected by semantic analyzer.

In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings
of symbols in natural language conforming to the rules of formal grammar. The origin of the
word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.

Concept of Parser

It is used to implement the task of parsing. It may be defined as the software component designed
for taking input data (text) and giving structural representation of the input after checking for
correct syntax as per formal grammar. It also builds a data structure generally in the form of parse
tree or abstract syntax tree or other hierarchical structure.

The main roles of the parse include −


∙ To report any syntax error.
∙ To recover from commonly occurring error so that the processing of the remainder of
program can be continued.
∙ To create parse tree.

∙ To create symbol table.


∙ To produce intermediate representations (IR).

Types of Parsing

Derivation divides parsing into the followings two types −

∙ Top-down Parsing
∙ Bottom-up Parsing

Top-down Parsing

In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then
tries to transform the start symbol to the input. The most common form of topdown parsing uses
recursive procedure to process the input. The main disadvantage of recursive descent parsing is
backtracking.

Bottom-up Parsing

In this kind of parsing, the parser starts with the input symbol and tries to construct the parser tree
up to the start symbol.

Concept of Derivation

In order to get the input string, we need a sequence of production rules. Derivation is a set of
production rules. During parsing, we need to decide the non-terminal, which is to be replaced
along with deciding the production rule with the help of which the non-terminal will be replaced.

Types of Derivation

In this section, we will learn about the two types of derivations, which can be
used to decide which non-terminal to be replaced with production rule −

Left-most Derivation

In the left-most derivation, the sentential form of an input is scanned and replaced from the left to
the right. The sentential form in this case is called the left-sentential form.

Right-most Derivation

In the left-most derivation, the sentential form of an input is scanned and replaced from right to
left. The sentential form in this case is called the right-sentential form.

Concept of Parse Tree

It may be defined as the graphical depiction of a derivation. The start symbol of derivation serves
as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes are
non-terminals. A property of parse tree is that in-order traversal will produce the original input
string.

Concept of Grammar

Grammar is very essential and important to describe the syntactic structure of well-formed
programs. In the literary sense, they denote syntactical rules for conversation in natural
languages. Linguistics have attempted to define grammars since the inception of natural
languages like English, Hindi, etc.

The theory of formal languages is also applicable in the fields of Computer Science mainly in
programming languages and data structure. For example, in ‘C’ language, the precise grammar
rules state how functions are made from lists and statements.
A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective for
writing computer languages.

Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P)


where −

∙ N or VN = set of non-terminal symbols, i.e., variables.


∙T or ∑ = set of terminal symbols.
∙S = Start symbol where S ∈ N
∙Pdenotes the Production rules for Terminals as well as Non-terminals. It has the form
α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to
VN

Phrase Structure or Constituency Grammar

Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency relation.
That is why it is also called constituency grammar. It is opposite to dependency grammar.

Example

Before giving an example of constituency grammar, we need to know the fundamental points
about constituency grammar and constituency relation.

∙ Allthe related frameworks view the sentence structure in terms of constituency relation. ∙
The constituency relation is derived from the subject-predicate division of Latin as well as
Greek grammar.
∙ The basic clause structure is understood in terms of noun phrase NP and verb phrase VP. We

can write the sentence “This tree is illustrating the constituency relation” as follows −

Dependency Grammar
It is opposite to the constituency grammar and based on dependency relation. It was introduced
by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar
because it lacks phrasal nodes.

Example

Before giving an example of Dependency grammar, we need to know the fundamental points
about Dependency grammar and Dependency relation.

∙ InDG, the linguistic units, i.e., words are connected to each other by directed links.
∙ The verb becomes the center of the clause structure.

∙ Every other syntactic units are connected to the verb in terms of directed link. These
syntactic units are called dependencies.

We can write the sentence “This tree is illustrating the dependency relation” as follows;
Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse
trees that uses dependency grammar is called dependency-based parse tree.

Context Free Grammar

Context free grammar, also called CFG, is a notation for describing languages
and a superset of Regular grammar. It can be seen in the following diagram

Definition of CFG

CFG consists of finite set of grammar rules with the following

four components − Set of Non-terminals


It is denoted by V. The non-terminals are syntactic variables that denote the sets of strings, which
further help defining the language, generated by the grammar.

Set of Terminals

It is also called tokens and defined by Σ. Strings are formed with the basic symbols of

terminals. Set of Productions

It is denoted by P. The set defines how the terminals and non-terminals can be combined. Every
production(P) consists of non-terminals, an arrow, and terminals (the sequence of terminals). Non
terminals are called the left side of the production and terminals are called the right side of the
production.

Start Symbol

The production begins from the start symbol. It is denoted by symbol S. Non-terminal symbol is
always designated as start symbol.

SPELLING ERROR DETECTION AND CORRECTION:


Spelling Correction is a very important task in Natural Language Processing. It is used in various
tasks like search engines, sentiment analysis, text summarization, etc. As the name suggests, we
try to detect and correct spelling errors in spelling correction. In real-world NLP tasks, we often
deal with data having typos, and their spelling correction comes to the rescue to improve model
performance. For example, if we want to search apple and type “aple,” we will wish that the
search engine suggests “apple” instead of giving no results.

Different approaches to spelling correction

Norvig’s approach

In this approach, we consider all possible words that can be formed using edits - insert, delete,
replace, split and transpose using the brute-force method.

For example, for word abc, all the possible words after edit are - ab ac bc bac cba acb a_bc ab_c
aabc abbc acbc adbc aebc etc.

These words are then added to a list. We again repeat the same thing for the words still having
errors. We then estimate all the words using the unigram language model. We pre-calculate the
frequency of every word from a large corpus. The word having the highest frequency is chosen.

∙ Adding more context

We can use n-gram models with n>1, instead of the unigram language model, providing
better accuracy. But, at the same time, the model becomes heavy due to a higher number
of calculations.

∙ Increasing speed: SymSpell Approach (Symmetric Delete Spelling Correction) Here


we pre-calculate all delete typos, unlike extracting all possible edits every time we
encounter any error. This will cost additional memory consumption.

∙ Improving memory consumption

We require large datasets ( at least some GBs) to obtain good accuracy. If we train the n
gram language models on small datasets like a few MBs, it leads to high memory
consumption. The symspell index and the language models occupy half-half of the size.
This increased usage is because we store frequencies instead of simple text. We can
compress our n-gram model by using paper. A perfect hash is a hash that can never have
collisions. We can use a perfect hash to store the counts of n-grams. As we don’t have any
collisions, we store the count frequencies instead of the original n-grams. We need to
make sure that the hash of unknown words does not match with those of known words, so
for that, we use a bloom filter with known words. A bloom filter utilizes space efficiently
to check whether an element belongs to a set or not. We can reduce the memory usage by
using a nonlinear quantization to pack the 32-bit long count frequencies into 16-bit values.

∙ Improving accuracy

We can use machine learning algorithms to improve accuracy further. We can start with a
machine learning classifier that decides whether a given the word has an error or not. Then we
can use a regression model to rank the words. This technique mimics the role of smoothing in
language models as it uses all the grams as input, and then a classifier decides the rank,
meaning how powerful each gram is. For word ranking, we will train a catboost(gradient
boosted decision trees) ranking model with multiple features like word frequency, word
length, edit distance between the source word and modified word(i.e., number of changes
required to convert the source to modified), n-gram language model prediction, the
frequencies of neighboring words, etc.

∙ We can gather a large corpus with errors and their corresponding corrected text and then
train a model to improve accuracy.
We can also have dynamic learning options. We can learn on the way while making
corrections. We can also have two pass corrections by learning some statistics in the first
pass and then making the actual correction in the second pass.

A spell checker is a tool used for analyzing and validating spelling mistakes in the text.
Recently, the role of a spell checker has diversified and it is also used to suggest possible
corrections to the detected spelling mistakes, auto-correct functionality, and even sentence
correction as a whole based on context and also grammar suggestions.
What are We Building?

Most spoken and written languages in the world are diverse, contain a lot of nuisances and are
grammatically rich, hence are very complex to handle.

∙ Grammar especially is vital for effective communication and transmission of


information and is used differently by different speakers in written text.
∙ To learn the rules of the different languages and styles of the different writers is a very
challenging task for algorithms and machines.
∙ Fortunately, these days with the utility of huge computing power and an arsenal of natural
language processing (NLP) techniques provides a solution to this problem.

Spell checkers ingrained into the text processors comes to aid with error detection and correction
as basic tasks necessary for any text processing software or tool.

Applications of Spell Checker

∙ We can use an arsenal of NLP techniques to detect wrongly spelled words and then
provide possible correct word suggestions and the probability of occurrence of each word
in the corpus.
∙ We can use spell checker algorithms to implement the autocorrect functionality that can
be used every day on our laptops, cell phones, computers, etc.
o One example of autocorrect is, if you type the word rnunnuing, chances are very
high that you meant to write running.

In this article, we aim to create a basic auto-correct spell-checker tool and explore the intuition
and workings of the algorithm.

Description of Problem Statement

∙ Utility of spelling correction: In the modern internet era, high-quality content is an


important asset in dispersing information and keeping people abreast of important things
happening around them.
o The quality of the content is mainly determined by the typos, misspelled words,
and grammatical mistakes made by the authors of the text.
o Hence we need to add an additional layer if we are to fabricate the content error
free and make things professional both from the viewpoints of the author and the
consumer.
∙ Examples of the need for spelling correction: Misspelt searches result in difficulty in
finding relevant information.
o Some basic exampleslike when users are searching for restaurants on Yelp, finding
friends on Facebook, shopping on multiple websites, and finding songs on Spotify. o
It is also relevant and necessary for tasks like machine learning / NLP as a
preprocessing step to tasks like sentiment analysis, intent detection, speech
recognition disambiguation, chatbots etc.
∙ Classification of misspelled words: Misspelt words are classified into two groups, non
word errors where the resulting token does not exist in the target language and real-
word errors where the resulting token is a valid word of the target language but different
from the intended word.
o The non-word errors are either not valid or not present in the lexicon.
o Real-word errors are usually more difficult to detect and correct than non-word
errors.

Error classification for Spell Checkers

∙ The different sources of noise in the human-written text can be categorized as


typographical and cognitive errors.
o Typographic errors also known as typos, consists of operations like: Insertion of
a character, deletion of a character, replacement of a character, and transposition
which is swapping two neighboring characters.
▪ Common spelling errors - These are the ones that have an error in the
spelling by 1-2 characters and which occur commonly. Example - viruz
( virus ) , treatmeit ( treatment )
▪ Words with a character in between them - These are the ones with a
character exists between two tokens/strings. Example - sanitized/dry
( sanitized dry ) , handuclean ( hand clean )
▪ Words with recurring characters - Words with characters that occur
repeatedly. Usually used when trying to express some sort of emotion in
text. Example - feeeevvvveeerrrrr ( fever ), facemaskk ( facemask )
o Cognitive errors entail operations like Phonetic errors where wrong word choice
is due to the similar pronunciation of words (e.g. confusing there with their) and
Grammatical errors where wrong word choice or wrong sentence structure due
to the lack of knowledge of grammatical rules (e.g. confusing his with her).

Pre-requisites

Context-free vs. context-dependent error correction

Approaches to spelling correction can be divided into context-free error correction algorithms
(also called isolated-word error correction algorithms) and context-dependent error correction
algorithms.

∙ Context-free error correction algorithms: Address the problem of detecting and


correcting misspelled words without considering the context a word appears. o Error
detection is done by a lookup in a dictionary of correctly spelled words, or by analyzing
whether a word’s character n-grams are all legal n-grams of the language at hand.
o Candidate generation and ranking is done based on the minimum edit distance
to the misspelled word, similarity keys, character n-grams, confusion probabilities,
word frequencies, or rule-based.
∙ Context dependent error correction algorithms: Context-free error correction algorithms
can only detect and correct a fraction of all spelling errors and will be unable to detect
real-word errors.
o Context is also needed to dissolve ambiguities. For example, consider the
misspelled word yello.
o Is the intended word hello, yellow, or yell? The answer depends on the context:
different corrections are plausible for the sequences Bananas are yello., yello
world and Don’t yello at me..

Some common algorithms for spell-checking include:

∙ We can build a basic spell checker simply by using a dictionary of words and then use
minimum edit distance algorithm which can be very simple yet powerful. ∙ We can also
build a spell checker that uses a combination of techniques like edit distance, phonetics,
and fuzzy search which gives superior performance with a balance of accuracy and speed.
∙ We can also train a sequence to sequence-based deep learning algorithms from a corpus
of data that can handle different kinds of errors directly.
∙ Off late, large language models proved powerful in utilizing the context of the word also
well and correcting all the errors in a sentence directly. These are called contextual spell
checkers.
∙ The only drawback of deep learning based models is that they are slow and there is
difficulty in implementing them as they have huge overhead during in runtime.

How Are We Going to Build This?

Traditional spell-checking algorithms typically divide the spell-checking task into three main
subtasks:

∙ Error detection: This is for distinguishing whether or not a word is misspelled. ∙


Candidate generation: This is for generating a set of candidate words that are similar to the
misspelled word.
∙ Candidate ranking: This is the final step for ranking the candidate words according to the
probability that they are the intended word for the misspelled word, or determining only
the most likely candidate word.

Minimum Edit Distance for Spell Checker

Let us implement a spell checker algorithm using the edit distance method which takes a word
as misspelled by default and generates candidates by making edits at different lengths and it
compares against the dictionary and assigns probabilities based on their occurrences in the
corpus.
∙ Edit Distance (also called Levenshtein Distance): Edit Distance is a measure of
similarity between two strings referred to as the source string and the target string. o The
distance between the source string and the target string is the minimum number of
edit operations (deletions, insertions, or substitutions) required to transform the source
into the target. The lower the distance, the more similar the two strings.
o Some other common applications of the Edit Distance algorithm are spell
checking, plagiarism detection, and translation memory systems.

Let us see an example of edit distance between the words ellu and hello:

∙ We simply need to count the total edits that need to go from source string ellu to correct
string hello.
∙ We can see from the illustration that we need one insertion, one substitution, and one
deletion, so in total 3.

Minimum Edit Distance Algorithm (MED)

∙ The algorithm checks if the word is in the dictionary, and exits further processing if it is
present in the dictionary.
∙ It then generates all the words one Levenshtein distance away. If any of these are in
dictionary, then it outputs these words.
∙ A distance matrix is created between the misspelled word and all possible permutations
(one Levenshtein) of the word.
o If we take a word and for basic implementation, we can use brute force to find all
possible edits, such as delete, insert, transpose, replace, and split.
o Example: For the word abc, possible candidates will be: ab ac bc bac cba acb a_bc
ab_c aabc abbc acbc adbc aebc etc.
o Each and every word is added to a candidate list.
∙ We repeat this procedure for every word for a second time to get candidates with bigger
edit distance (for cases with two errors).\
o The algorithm then generates all the words two Levenshtein distance away. If
any of these are in the dictionary, then it outputs these words.
o Each candidate is estimated with a unigram language model (or single-word
corpus).
∙ If there is more than one candidate in each step, the algorithm takes the ones with the
highest score (score can be computed based on the corpus)
o For each vocabulary word, frequencies are pre-calculated, based on the big text
corpora we have collected. The candidate word with the highest frequency is taken
as an answer.
∙ Dynamic Programming (DP): Instead of brute force we have taken as an example, we can
use DP for calculating the least possible changes needed to correct the misspelled words,
and suggesting the most appropriate words as the corrections.
Steps for building the Minimum Edit Distance algorithm

∙ Step 1: Preparation of the word database.


o The database works great if we can use get the search words and their occurrence
based on the actual usage or domain of application.
o If there is no such database, we can instead use a large set of text from any
available corpus and count the occurrence/popularity of each word from that. ∙ Step 2:
Word checking - Finding words that are similar to the one checked. o Similar means that
the edit distance is low (typically 0-1 or 0-2).
o The edit distance is the minimum number of inserts/deletes/changes/swaps needed
to transform one word to another.
o Choose the most popular word from the previous step and suggest it as a
correction (if other than the word itself).

Final Output

Let me show you the final output which will be building.

Input Text: "I came home so that as I would rather participate in the function the next dys."

1
{dys: 'day'}
I came home so that as I would rather participate in the function the next

day Building a Spell Checker with NLP

Libraries and Imports

Jupyter Notebook can be installed from the popular setup of Anaconda or Jupyter directly.
Reference instructions for setting up from source can be followed from here. Once installed we
can spin up an instance of the jupyter notebook server and open a python notebook instance and
run the following code for setting up basic libraries and functionalities.

# Import the libraries for processing data and other utilities

import os, sys, gc, warnings


import logging, math, re, heapq
import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
from [Link] import display, HTML
from [Link] import InteractiveShell
from collections import Counter
from [Link] import word_tokenize
# These settings help in proper formatting and display of the output of code we
run [Link]("ignore")
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)
InteractiveShell.ast_node_interactivity = "all"
display(HTML(data="""<style> div#notebook-container{width:95%;}</style>"""))

# This is to download stop words for cleaning the tweets


import nltk
[Link]('stopwords')

If a particular library is not found, it can simply be downloaded from within the notebook by
running a system command pip install sample_library command. For example to install seaborn,
run the command from a cell within jupyter notebook like:

! pip install seaborn

Dataset

∙ There is no specific dedicated dataset for building and evaluating spell checkers.
Practitioners usually use data from various sources like Wikipedia, news articles, short
stories, newspapers, online websites, and others.
∙ The dataset we use is a Kaggle dataset based on the complete works of Shakespeare and
corpus contains text collected from various writings. The dataset can be downloaded
from here The Complete Works of William Shakespeare. It can be downloaded and kept
in the same folder from which the code we write can be run.
∙ Change the location of the file to where it is present in your system here

input_file_location = 'spell_checker/[Link]'
1
From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
2
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a tattered weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;

Data Preprocessing

∙ Real-world data are often incomplete, inconsistent, inaccurate, and lack specifically
required trend. Therefore, data preprocessing is a primary and most significant step in
natural language processing (NLP). It is a crucial process as it directly affects the success
rate of the model. The typical steps involved in data preprocessing are:
∙ Lexical Processing tasks like tokenization, stop word removal, stemming, lemmatization,
TF-IDF representation
∙ Syntactic Processing like POS (Part of Speech) Tagging and NER (Named Entity
Recognition)

# We will preprocess the data by reading the file from the default directory and then prepare a
list of words in the
# entire corpus, we will also lowercase the words while processing them

# Initiating the word list here


word_list = []
# read the file and append words one by one
with open(file_name, 'r') as file:
file_read = [Link]()
pre_process_file = [Link](r"\w+", file_read)
word_list = pre_process_file
for word in word_list:
word_list.append([Link]())

# This will be our new vocabulary


vocab = set(word_list)

Now we will take the set of words representing the corpus prepared from earlier and prepare a
word count dictionary where key is the word and values are the frequencies of the words.

# Initiating the word_count dictionary and populating it


word_count_dict = {}
word_count_dict = Counter(word_list)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'thee' is {word_count_dict.get('thee',0)}")
We will now use the word count dictionary where the key is the word and value is its frequency
and prepare a new dictionary where keys are the words and the values are the probability
that a word will occur.

# Initalize the probability dictionary


probs = {}
total_words = sum(word_count_dict.values())

for word, word_count in word_count_dict.items():


word_prob = word_count/total_words
probs[word] = word_prob
print(f"Length of probs is {len(probs)}")

# Let us use both the dictionaries for both word counts and probabilities and display an example
word.
print(f"P('thee') is {probs['thee']:.4f}")
print(word_count_dict['thee'])

Named-entity recognition (NER) NER is a problem that has a goal to locate and classify named
entities mentioned in unstructured text into pre-defined categories such as person names,
organizations, locations, medical codes, time expressions, etc.

∙ With NER, we can scan through large documents and finding people, organizations, and
locations available and then later we needn't run spelling corrections on these entities.

Let us see an example of how to do NER in python using Spacy library.


import spacy
nlp = [Link]('en_core_web_sm')
doc = nlp("I live in New York and work as a data scientist.")
for word in [Link]:
print([Link], word.label_)

∙ The above gives output as New York is GPE where GPE is a Geo-Political Entity. This
way we can incorporate NER in the pipeline for spell checker algorithms.
∙ Spacy and most other libraries also support the many entity types for example like:
PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports
etc.), ORG (organizations), GPE (countries, cities etc.), etc, for NER.

Building the Spell Checker Model

Let us write a few helper functions here that will do the different operations required for
suggesting the correct words in the spell checker algorithm. Let us look at the brief summary of
each of these functions:

∙ delete_letter - When we give a word, this function will return all the possible strings that
have one character removed.
∙ switch_letter - When we give a word, this function will return all the possible strings that
have two adjacent letters switched.
∙ replace_letter - When we give a word, this function will return all the possible strings that
have one character replaced by another different letter.
∙ insert_letter - When we give a word, this function will return all the possible strings that
have an additional character inserted.
∙ edit_one_letter - This function will give all possible edits that are one edit away from a
word such that the edits consist of the replace, insert, delete, and optionally the switch
operation.
∙ edit_two_letters - We can then generalize the edit_one_letter function to implement to get
two edits on a word. We will have to get all the possible edits on a single word and then,
for each modified word, we would have to modify it again.

def delete_letter(word):
delete_list = []
split_list = []
split_list = [(word[:i], word[i:]) for i in range(len(word))]
delete_list = [L+R[1:] for L, R in split_list]
return delete_list

def switch_letter(word):
switch_list = []
split_list = []
split_list = [(word[:i], word[i:]) for i in range(len(word))]
switch_list = [L + R[1] + R[0] + R[2:] for L, R in split_list if len(R)>=2]
return switch_list
def replace_letter(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
replace_list = []
split_list = []
split_list = [(word[0:i], word[i:]) for i in range(len(word))]
replace_list = [L + letter + (R[1:] if len(R)>1 else '') for L, R in split_list if R for letter in letters]
replace_set = set(replace_list)
replace_list = sorted(list(replace_set))
return replace_list

def insert_letter(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
insert_list = []
split_list = []
split_list = [(word[0:i], word[i:]) for i in range(len(word)+1)]
insert_list = [L + letter + R for L, R in split_list for letter in letters]
return insert_list
def edit_one_letter(word, allow_switches = True):
edit_one_set = set()
edit_one_set.update(delete_letter(word))
if allow_switches: edit_one_set.update(switch_letter(word))
edit_one_set.update(replace_letter(word))
edit_one_set.update(insert_letter(word))
if word in edit_one_set: edit_one_set.remove(word)
return edit_one_set

def edit_two_letter(word, allow_switches = True):


edit_two_set = set()
edit_one = edit_one_letter(word, allow_switches=allow_switches)
for word in edit_one:
if word:
edit_two = edit_one_letter(word, allow_switches=allow_switches)
edit_two_set.update(edit_two)

return edit_two_set

Let us wrap all the helper functions into a single function here.

∙ This function takes the word, the dictionary of probabilities, the vocabulary set of the
overall corpus, and the other input parameters if any, we want to incorporate into a
spelling suggestion function.
∙ This function will output again with the most probable n corrected words and their
probabilities which we need to pick again based on our need.
def get_spelling_suggestions(word, probs, vocab, n=2,):
suggestions = []
top_n_suggestions = []
suggestions = list((word in vocab and word) or
edit_one_letter(word).intersection(vocab) or
edit_two_letter(word).intersection(vocab))
top_n_suggestions = [[s, probs[s]] for s in list(suggestions)]
return top_n_suggestions

Let us run the overall function on a few words here.

my_words = ['dys','furthar','mercuryn','disdaain','tumtultous']
tmp_corrections = []
for word_c in my_words:
tmp_corrections.append(get_spelling_suggestions(word_c, probs, vocab, 3))
for i, word in enumerate(my_words):
print(' ')
print(f'Word - {my_words[i]}')
for j, word_prob in enumerate(tmp_corrections[i]):
print(f"word - {j}: {word_prob[0]}, probability {word_prob[1]:.6f}")

The sample input and output look like below:

∙ When we provide an input word that is spelled wrong, the spell checker function we wrote
provides all the alternate suggestions based on the dictionary and edits.
∙ It also provides a probability for each word from which we can pick top word or the next
ones and incorporate into our workflow.

Output:

Word - dys
word - 0: dye, probability $0.000005$
word - 1: days, probability $0.000225$
word - 2: dy, probability $0.000005$
word - 3: dis, probability $0.000005$
word - 4: des, probability $0.000003$
Word - furthar
word - 0: further, probability $0.000204$
Word - mercuryn
word - 0: mercury, probability $0.000016$
Word - disdaain
word - 0: disdain, probability $0.000042$
Word - tumtultous
word - 0: tumultuous, probability $0.000004$
Contextual models for Spell Check

There are many implementations based on deep learning language models that can efficiently
handle out-of-vocabulary words and also take context into consideration in correcting the
spelling.

Let us look at an example of one such open-source implementation using Spacy.

∙ The contextual spell check package focuses on Out of Vocabulary (OOV) word or non
word error (NWE) correction using the BERT model.
∙ The library performs this Contextual spell correction using BERT (bidirectional
representations). The idea of using BERT was to use the context when correcting NWE.

try:
import contextualSpellCheck
except:
! pip install contextualSpellCheck
import contextualSpellCheck

try:
import spacy
except:
! pip install contextualSpellCheck
import spacy

nlp = [Link]('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)
doc = nlp('I came home so that as I would rather participate in the function the next dys.')

# This shows the number of corrections in the input text.


print(len(doc._.suggestions_spellCheck))

# This shows all the actual corrections that were made with the associated
mapping. print(doc._.suggestions_spellCheck)

This displays the outcome after spelling correction


print(doc._.outcome_spellCheck)

Following is the output of the above code. We can see that it could handle the dys word and
interpret it as the day, whereas our corpus-based edit distance algorithm gave the highest
probability of dye. The only caveat to the deepest learning-based implementations is that they can
be expensive and slow to run.

1
{dys: 'day'}
I came home so that as I would rather participate in the function the next day.
What’s Next?

∙ The algorithm we have written is basic and, to a large extent depends on the corpus and
dictionary we have built, and if a certain word we give as input is not present in the
corpus, it doesn't give any alternate suggestion. This can be solved by giving a large
enough corpus as input.
∙ The model also doesn't give great performance and will be unable to scale beyond
distance two.
∙ The model will also be slow and spell checking a 200 word paragraph on average takes
about 5 to 10 seconds.
∙ We can take the help of many techniques to augment the base algorithm and also use
other techniques and libraries utilizing NLP & Machine learning techniques to implement
the spell checker.
∙ One such technique is dynamic programming which can help in finding the correct word
in the minimum steps possible.
o To find the shortest path to go from the word, a dynamic programming system
helps us in finding the minimum number of edits required to convert a string into
another string.
∙ Fast fuzzy search is another technique that can be used to boost the accuracy of spell
checkers. It also helps in other tools like indexed DB searches, elastic search, NLP, and
speech recognition.

∙ Spell checker is a combination of features and tools that checks for misspellings in a text
and is often needed in text preprocessing & also embedded in software such as word
processors, text clients, dictionaries or search engines, etc.
∙ Different algorithms can be used for handling tasks of spell checker like error
identification, candidate generation, and candidate ranking.
∙ Levenshtein distance between two words is the minimum number of single-character edits
(insertions, deletions, or substitutions) required to change one word into the other. ∙ Dynamic
programming algorithm can be combined with Levenshtein distance to quickly compute the
lowest cost path and suggest possible corrections in spell checkers. ∙ Distance-based methods
with optimizations to handle different scenarios are the best algorithms that can achieve a
good balance of accuracy and speed.

WORD CLASSES AND PART-OF-SPEECH TAGGING IN NLP

Part-of-speech (POS) tagging is an important Natural Language Processing (NLP) concept that
categorizes words in the text corpus with a particular part of speech tag (e.g., Noun, Verb,
Adjective, etc.)

POS tagging could be the very first task in text processing for further downstream tasks in NLP,
like speech recognition, parsing, machine translation, sentiment analysis, etc.
The particular POS tag of a word can be used as a feature by various Machine Learning
algorithms used in Natural Language Processing.

Introduction

Simply put, In Parts of Speech tagging for English words, we are given a text of English words
we need to identify the parts of speech of each word.

Example Sentence : Learn NLP from Scaler

Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN

Although it seems easy, Identifying the part of speech tags is much more complicated than
simply mapping words to their part of speech tags.

Why Difficult ?

Words often have more than one POS tag. Let’s understand this by taking an easy
example. In the below sentences focus on the word “back” :

Sentence POS tag of word “back”


The “back” door ADJECTIVE
On my “back” NOUN
Win the voters “back” ADVERB
Promised to “back” the bill VERB

The relationship of “back” with adjacent and related words in a phrase, sentence, or paragraph is
changing its POS tag.

It is quite possible for a single word to have a different part of speech tag in different sentences
based on different contexts. That is why it is very difficult to have a generic mapping for POS
tags.

If it is difficult, then what approaches do we have?

Before discussing the tagging approaches, let us literate ourselves with the required knowledge
about the words, sentences, and different types of POS tags.

Word Classes

In grammar, a part of speech or part-of-speech (POS) is known as word class or grammatical


category, which is a category of words that have similar grammatical properties.

The English language has four major word classes: Nouns, Verbs, Adjectives, and Adverbs.
Commonly listed English parts of speech are nouns, verbs, adjectives, adverbs, pronouns,
prepositions, conjunction, interjection, numeral, article, and determiners.

These can be further categorized into open and closed classes.

Closed Class

Closed classes are those with a relatively fixed/number of words, and we rarely add new words to
these POS, such as prepositions. Closed class words are generally functional words like of, it,
and, or you, which tend to be very short, occur frequently, and often have structuring uses in
grammar.

Example of closed class

Determiners: a, an, the Pronouns: she, he, I, others Prepositions: on, under, over, near, by, at,
from, to, with

Open Class
Open Classes are mostly content-bearing, i.e., they refer to objects, actions, and features; it's
called open classes since new words are added all the time.

By contrast, nouns and verbs, adjectives, and adverbs belong to open classes; new nouns and
verbs like iPhone or to fax are continually being created or borrowed.

Example of open class

Nouns: computer, board, peace, school Verbs: say, walk, run, belong Adjectives: clean, quick,
rapid, enormous Adverbs: quickly, softly, enormously, cheerfully

Tagset

The problem is (as discussed above) many words belong to more than one word class.

And to do POS tagging, a standard set needs to be chosen. We Could pick very simple/coarse
tagsets such as Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), etc.

But to make tags more dis-ambiguous, the commonly used set is finer-grained, University of
Pennsylvania’s “UPenn TreeBank tagset”, having a total of 45 tags.

Tag Description Example Tag Description Example CC Coordin. Conjunction


and, but, or SYM Symbol +%, & CD Cardinal number one, two, three TO "to" to
DT Determiner a, the UH Interjection ah, oops EX Existential 'there' there VB
Verb, base form eat FW Foreign word mea culpa VBD Verb, past tense ate
Tag Description Example Tag Description Example IN Preposition/sub-conj of,
in, by VBG Verb, gerund eating JJ Adjective yellow VBN Verb, past participle
eaten JJR Adj., comparative bigger VBP Verb, non-3 sg pres eat JJS Adj.,
superlative wildest VBZ Verb, 3 sg pres eats LS List item marker 1, 2, One WDT
Wh-determiner which, that MD Modal can, should WP Wh-pronoun what, who NN
Noun, sing. or mass llama WP$ Possessive wh- whose NNS Noun, plural llamas
WRB Wh-adverb how, where NNP Proper noun, singular IBM $ Dollar sign $
NNPS Proper noun, plural Carolinas # Pound sign #
PDT Predeterminer all, both "، Left quote ( or ) POS Possessive ending 's "
Right quote ( or ) PRP Personal pronoun I, you, he ( Left parenthesis ([,(,{,<)
PRP$ Possessive pronoun your, one's ) Right parenthesis (],),},>) RB Adverb
quickly, never , Comma
RBR Adverb, comparative faster . Sentence-final punc (. ! ?) RBS Adverb,
superlative fastest : Mid-sentence punc (: ; ...) RP Particle up, off

Parts of Speech Tagging in NLP

Part-of-speech tagging is the process of assigning a part of speech to each word in a text. The
input is a sequence x1,x2,...,xnx1,x2,...,xn of (tokenized) words, and the output is a sequence
y1,y2,...,yny1,y2,...,yn of POS tags, each output yiyi corresponding exactly to one input xixi.

Tagging is a disambiguation task; words are ambiguous i.e. have more than one a possible part of
speech, and the goal is to find the correct tag for the situation.

For example, a book can be a verb (book that flight) or a noun (hand me that book). The goal

of POS tagging is to resolve these ambiguities, choosing the proper tag for the context. POS

tagging Algorithms Accuracy:

The accuracy of existing State of the Art algorithms of part-of-speech tagging is extremely high.
The accuracy can be as high as ~ 97%, which is also about the human performance on this task, at
least for English.
We’ll discuss algorithms/techniques for this task in the upcoming sections, but first, let’s explore
the task. Exactly how hard is it?

Let's consider one of the popular electronic collections of text samples, Brown Corpus. It is a
general language corpus containing 500 samples of English, totaling roughly one million words.

In Brown Corpus :

85-86% words are unambiguous - have only 1 POS tag

but,

14-15% words are ambiguous - have 2 or more POS tags

Particularly ambiguous common words include that, back, down, put, and set.

The word back itself can have 6 different parts of speech (JJ, NN, VBP, VB, RP, RB) depending
on the context.

Nonetheless, many words are easy to disambiguate because their different tags aren’t equally
likely. For example, "a" can be a determiner or the letter "a", but the determiner sense is much
more likely.

This idea suggests a useful baseline, i.e., given an ambiguous word, choose the tag which is most
frequent in the corpus.

This is a key concept in the Frequent Class tagging approach.

Let’s explore some common baseline and more sophisticated POS tagging
techniques. Rule-Based Tagging

Rule-based tagging is the oldest tagging approach where we use contextual information to assign
tags to unknown or ambiguous words.

The rule-based approach uses a dictionary to get possible tags for tagging each word. If the word
has more than one possible tag, then rule-based taggers use hand-written rules to identify the
correct tag.

Since rules are usually built manually, therefore they are also called Knowledge-driven taggers.
We have a limited number of rules, approximately around 1000 for the English language.

One of example of a rule is as follows:

Sample Rule: If an ambiguous word “X” is preceded by a determiner and followed by a noun,
tag it as an adjective;
A nice car: nice is an ADJECTIVE here.

Limitations/Disadvantages of Rule-Based Approach:

∙ High development cost and high time complexity when applying to a large corpus of text ∙
Defining a set of rules manually is an extremely cumbersome process and is not scalable at
all

Stochastic POS Tagging

Stochastic POS Tagger uses probabilistic and statistical information from the corpus of labeled
text (where we know the actual tags of words in the corpus) to assign a POS tag to each word in a
sentence.

This tagger can use techniques like Word frequency measurements and Tag Sequence
Probabilities. It can either use one of these approaches or a combination of both. Let’s discuss
these techniques in detail.

Word Frequency Measurements

The tag encountered most frequently in the corpus is the one assigned to the ambiguous
words(words having 2 or more possible POS tags).

Let’s understand this approach using some example sentences :

Ambiguous Word = “play”

Sentence 1 : I play cricket every day. POS tag of play = VERB

Sentence 2 : I want to perform a play. POS tag of play = NOUN


The word frequency method will now check the most frequently used POS tag for “play”. Let’s
say this frequent POS tag happens to be VERB; then we assign the POS tag of "play” = VERB

The main drawback of this approach is that it can yield invalid sequences of

tags. Tag Sequence Probabilities

In this method, the best tag for a given word is determined by the probability that it occurs with
“n” previous tags.

Simply put, assume we have a new sequence of 4 words, w1 w2 w3 w4w1 w2 w3 w4 And we


need to identify the POS tag of w4w4.

If n = 3, we will consider the POS tags of 3 words prior to w4 in the labeled corpus of text
Let’s say the POS tags for

w1w1 = NOUN, w2w2 = VERB , w3w3 = DETERMINER

In short, N, V, D: NVD

Then in the labeled corpus of text, we will search for this NVD sequence.

Let’s say we found 100 such NVD sequences. Out of these -

10 sequences have the POS of the next word is NOUN 90 sequences have the POS of the next
word is VERB

Then the POS of the word w4w4 = VERB

The main drawback of this technique is that sometimes the predicted sequence is not
Grammatically correct.

PROBABILISTIC PARSING :

is using dynamic programming algorithms to compute the most likely parse(s) of a given
sentence, given a statistical model of the syntactic structure of a language. Research at
Stanford has focused on improving the statistical models used as well as the algorithms.

A probabilistic context free grammar is a context free grammar with probabilities attached to
the rules.

Model Parameters
The model parameters are the probabilities assigned to grammar rules.
Computing Probabilities
We discuss how the model assigns probabilities to strings and to analyses of strings.
Exploiting Probabilities in Parsing
We discuss how to find the most probable parse given the model.
Estimating Parameters
We sketch how rule probabilities are estimated from a syntactically annotated corpus.
Important Probabilities
Rule Probabilities
Probability of an expansion given the category being expanded. P(γ | A), the probability
that the sequence of grammar symbols γ will expand category A. Given the definition of a
conditional probability, this means the probabilities of all the expansions of a single
catgeory must add to 1. For example:
S -> NP VP, .8 | Aux NP VP, .2
Given that S is being expanded, the probability of the NP VP rule is .8; the probability of
the Aux NP VP expansion is .2. The rule probabilities are the parameters of a PCFG model. Tree
Probability
The probability of a tree given all the rules used to construct it. This will be the product
of the probabilities of all the rules used to construct it. We motivate this below.

For now an example. Given this grammar:

the sentence Book the dinner flight


has at least two trees:
P(Tleft) = .05 * .20 * .20 * .20 * .75 * .30 * .60 * .10 * .40 = 2.2 × 10-6
P(Tright) = .05 * .10 * .20 * .15 * .75 * .75 * .30 * .60 * .10 * .40 = 6.1 × 10-7

Therefore, the more probable of the two parses is Tleft, which is the intuitive

result. *****

Common questions

Powered by AI

Parse trees offer several benefits in understanding grammatical structures as they provide a clear, hierarchical visualization of the derivation process from a start symbol, showing how sentences are structured according to grammar rules by explicitly mapping non-terminals to terminals. However, they have limitations in that constructing parse trees can be computationally intensive, particularly for complex sentences with ambiguous grammar. Additionally, they do not inherently capture semantic nuances, unlike dependency-based models that emphasize syntactic dependencies among words .

Constituency grammar and dependency grammar differ in how they represent sentence structures. Constituency grammar is based on constituency relations, where the sentence is divided into parts such as noun phrases (NP) and verb phrases (VP), and these parts have internal phrasal nodes . Dependency grammar, introduced by Lucien Tesniere, focuses on the dependency relations between words, where the verb is central, and lacks phrasal nodes. Each word is linked directly to the verb, showing a hierarchy of dependencies .

Norvig’s approach to spelling correction employs traditional methods such as edit distance through operations like insertion, deletion, swap, and substitution, using probability-based candidate ranking on a corpus . In contrast, large language model approaches leverage context with deep learning to correct errors. While Norvig’s approach can quickly identify misspellings and suggest corrections through straightforward probabilistic methods, large language models provide more nuanced corrections by considering context, but they can be slower and resource-intensive due to the complex computations involved in understanding and generating language .

Finite automata are closely related to regular grammars in their ability to define regular languages. A finite automaton can be implemented as, and is equivalent to, regular expressions, which are another formalism for characterizing regular languages. Additionally, regular grammars, which may be right-regular or left-regular, are another method to describe regular languages. Together, these three—finite automata, regular expressions, and regular grammars—constitute different approaches for describing and recognizing regular languages in computational theory .

The transition function δ is critical in Finite State Automata (FSA) because it defines how the automaton progresses from one state to another based on input symbols. It captures the rules of state transitions and dictates the behavior of the FSA by specifying the next state for each pair of current state and input symbol. This function is essential for the automaton to process inputs and reach acceptance or rejection for input sequences. The effectiveness and correctness of an FSA in recognizing languages largely depend on the accurate definition of δ, which implicitly embeds the language's syntax and processing logic .

One of the main challenges for spelling correctors is distinguishing between real-word errors where the misspelling forms another valid word (e.g., 'form' vs. 'from'). Contextual information is crucial in resolving these issues as it can help determine the intended meaning based on surrounding words. For instance, the context can clarify whether 'lead' should be corrected to 'led' or 'lead' based on its usage in a sentence. Utilizing larger context through models or more advanced algorithms can improve the accuracy in detecting and correcting these types of errors, although it remains computationally challenging .

Context-Free Grammar (CFG) is a more expressive language representation than Regular Grammar. Regular grammars, which can be fully described by finite automata and regular expressions, have constraints that make them unsuitable for representing nested structures or dependencies, a limitation CFG overcomes. CFGs can generate languages that require matching symbols, such as balanced parentheses, which cannot be represented by regular grammars. Thus, CFGs represent a superset of regular grammars, providing the added capability to describe a wider variety of syntactical structures in computer languages .

Using dynamic programming in spell checking algorithms greatly enhances efficiency compared to brute-force methods. Dynamic programming helps in calculating the least number of edits required to transform one word to another, thereby quickly suggesting the most likely correct word. This efficiency is achieved by avoiding redundant calculations through the use of memoization or tabulation, which is not the case in brute-force methods that explore every possible edit path. As such, dynamic programming provides a systematic way to solve the edit distance problem, making it more suitable for large-scale applications where speed is critical .

Part-of-speech (POS) tagging contributes significantly to tasks such as speech recognition and machine translation by providing essential grammatical categorization for each word in the text. In speech recognition, POS tags help disambiguate words based on their usage, improving the understanding of spoken language inputs. In machine translation, they enable accurate translation of words by recognizing their grammatical roles, which is crucial for maintaining semantic meaning across languages. POS tagging lays the foundation for these NLP applications by enhancing syntactic and semantic parsing, ultimately leading to more accurate and natural language processing .

The properties of regular sets are important because they ensure that operations such as union, intersection, complement, difference, reversal, closure, and concatenation will yield regular sets. This consistency means that regardless of the operations performed, the resulting set remains within the realm of regular languages, which can be represented using finite state automata (FSA) or regular expressions. This contributes to computational theory by providing a stable framework to work with these languages, ensuring that they can be effectively described, analyzed, and implemented in computational models .

You might also like