0% found this document useful (0 votes)
11 views7 pages

Unit 2 NLP

The document discusses regular expressions (RE) as powerful tools for searching and matching strings in text, detailing operations like concatenation, union, and Kleene star. It also covers morphological parsing in NLP, explaining how words are analyzed into morphemes, and introduces finite state automata (FSA) for pattern recognition. Additionally, it addresses spelling error detection and correction techniques in NLP, providing methods and examples for improving text quality.

Uploaded by

anikethbhosale11
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Unit 2 NLP

The document discusses regular expressions (RE) as powerful tools for searching and matching strings in text, detailing operations like concatenation, union, and Kleene star. It also covers morphological parsing in NLP, explaining how words are analyzed into morphemes, and introduces finite state automata (FSA) for pattern recognition. Additionally, it addresses spelling error detection and correction techniques in NLP, providing methods and examples for improving text quality.

Uploaded by

anikethbhosale11
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit-2

Regular Expressions
Regular expressions (RE) are specialized patterns for searching and matching strings within
text. They provide a concise and powerful language to specify text search criteria. REs include
operations such as:
 Concatenation (e.g., a.b means character a followed by b)
 Union or alternation (e.g., a+b means a or b)
 Kleene star (e.g., a* meaning zero or more repetitions of a)
Mathematically, basic regular expressions include empty string (ε), empty language (φ), and
can be combined using union (+), concatenation (.), and Kleene star (*).
Regular expressions (REs) provide a formal and flexible way to specify search patterns in text.
In NLP, they are widely used for:
Text searching and pattern matching: Quickly find substrings matching specific syntactic
patterns.
Tokenization: Defining rules to split running text into words, sentences, or other tokens.
Information extraction: Detecting structured elements such as dates, numbers, emails, or
domain-specific entities.
Preprocessing and normalization: Identifying and transforming text snippets (e.g., removing
punctuation or matching variants).

Regular set to represent the following patterns


(i) (a+b): union of characters 'a' and 'b'
(ii) a.b : concatenation of characters 'a' followed by 'b'
(iii) a* : Kleene star representing zero or more repetitions of character 'a'

The regular sets corresponding to each given pattern:


(i) (a + b)
Regular set: {a, b}
This represents the union of the single-character strings a and b.
(ii) a.b
Regular set: {ab}
This represents the concatenation of 'a' followed by 'b', forming the string ab.
(iii) a*
Regular set: {ε, a, aa, aaa, …}
Unit-2

This represents zero or more repetitions of the character 'a', including the empty string ε.

Write the regular expression for the following languages


(i) The language accepts all the strings containing any numbers a’s and b’s
(ii) The language accepts all the strings which are starting with 1’s and end with 0’s
(iii) The language starts with “a” but not having consecutive b’s

Regular Expressions for the Given Languages


(i) Language accepting all strings containing any numbers, a's, and b's:
[0-9ab]*
(ii) language accepting all strings starting with 1's and ending with 0's:
 Strings start with one or more '1's
 End with one or more '0's
 Middle may be any sequence of '0's and '1's

Regular Expression: 1+[01]*0+


Where:
1+ means one or more '1's, [01]* means zero or more '0' or '1', and 0+ means one or
more '0's
(iii) Language starting with "a" but not having consecutive 'b's:
a(a|ba)*

Morphological parsing in the context of natural language processing like prefixes,


suffixes.
Morphological parsing in Natural Language Processing (NLP) is the computational process of
analyzing the structure of words to identify their smallest meaningful components called
morphemes. These components include roots (base forms), prefixes (affixes added to the
beginning of a word), and suffixes (affixes added to the end of a word).
The goal of morphological parsing is to break down words into these constituent morphemes
to understand their formation and meaning. For example, the word “unhappiness” can be parsed
into:
Prefix: "un-"
Root: "happy"
Suffix: "-ness"
This analysis enables NLP systems to recognize the relationship between words, such as
identifying that "happy" is related to "unhappy" and "happiness" despite different surface
forms.
Unit-2

Morphological parsing helps in several NLP tasks including text normalization, spelling
correction, machine translation, and information retrieval.

How Morphological Parsing Works


It uses morphological rules that describe how morphemes combine in a particular language.
Often implemented using finite-state transducers (FSTs) which are state machines that map
surface words to their morphemes.
Handles orthographic rules, such as changing “y” to “ies” for pluralization in English (e.g.,
“country” to “countries”).

Role of Prefixes and Suffixes


Prefixes attach to the front of a root word and modify the meaning, e.g., "un-" in "undo," "pre-
" in "preview."
Suffixes attach to the end of the root, often indicating grammatical properties, e.g., "-ed" for
past tense (walked), "-s" for plural (cats), "-ness" for nouns (happiness).

Finite State Automaton (FSA)


A Finite State Automaton (FSA), also known as a finite-state machine or finite automaton, is
a mathematical computational model used to recognize patterns and accept or reject input
strings based on a finite number of states.
Components of FSA
Unit-2

Description of Components:
States: Represent possible conditions or configurations the automaton can be in, depicted as
circles.
Input Alphabet: Symbols the automaton reads from input sequentially.
Start State: The state where computation begins, indicated by an arrow with no origin.
Transition Function: Defines how the FSA moves between states on input symbols.
Final/Accepting State(s): If the FSA ends in one after consuming all input, the input string is
accepted.

Constructing FSA for Regular Expression /abb|acb/


Unit-2
Unit-2

Spelling Error Detection and Correction in NLP


Spelling error detection identifies misspelled or incorrectly typed words in text, while spelling
correction suggests possible correct versions. Together, these processes improve text quality,
essential for downstream NLP tasks.
Common Techniques for Spelling Error Detection
 Dictionary Lookup: Compare each word against a known dictionary; words not found
are flagged as errors.
 N-gram Analysis: Use character n-grams to detect unlikely or rare sequences
suggestive of errors.
 Statistical Models: Identify words with low probability or unusual context based on
language corpora.
Common Techniques for Spelling Error Correction
 Edit Distance Based Methods: Calculate minimum edits (insertions, deletions,
substitutions, transpositions) needed to transform a misspelled word into dictionary
words. Words with minimal edit distance are correction candidates.
 Noisy Channel Model: Treat the observed misspelled word as a distorted form of a
true word. Use Bayesian inference to find the most probable original word given the
typo.
 Probabilistic Models: Estimate likelihood of corrections based on language and error
statistics.
 Neural Networks: Use deep learning models (e.g., sequence-to-sequence, contextual
models) to learn error patterns and contextual corrections.

Example Approach: Peter Norvig’s Spell Correction Algorithm
Generate candidate corrections by applying edits at an edit distance of one or two:
 Deletions (remove a character),
 Transpositions (swap adjacent characters),
Unit-2

 Replacements (replace a character),


 Insertions (add a character).
 Retain candidates that appear in a dictionary.
 Choose the candidate with the highest occurrence probability from a corpus.

Question Bank
1. Explain the role of regular expressions in Natural Language Processing (NLP). Write
the regular expression for the following languages
(i) The language accepts all the strings containing any numbers a’s and b’s
(ii) The language accepts all the strings which are starting with 1’s and end with 0’s
(iii) The language starts with “a” but not having consecutive b’s
2. Write regular set to represent the following patterns
(i) (a+b): union of characters 'a' and 'b'
(ii) a.b : concatenation of characters 'a' followed by 'b'
(iii) a* : Kleene star representing zero or more repetitions of character 'a'.
3. With a suitable examples, explain the morphological parsing in the context of natural
language processing like prefixes, suffixes.
4. Define a Finite State Automaton (FSA) and explain its components in detail. Construct
the transition diagram and develop the state transition table for the regular expression
/abb|acb/. Illustrate how the automaton processes input strings according to this
expression.
5. Construct the transition diagram and develop the state transition table for the regular
expression /(a|b)^*baa$/. Include details on states, transitions, and final state
conditions.
6. State spelling error detection and correction in Natural Language Processing. Explain
the common techniques used for detecting spelling errors and providing corrections.
Illustrate your answer with an example algorithm or approach commonly used in NLP
for this task.
7. Explain two step morphological parsing with a suitable examples.

You might also like