NLP Word Level Analysis Techniques

Uploaded by

deekshitha1325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views81 pages

NLP Word Level Analysis Techniques

Uploaded by

deekshitha1325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NATURAL LANGUAGE PROCESSING

(NLP)
MODULE 2
Word level Analysis

1 Regular Expression
2. Finite state Automata
3. Morphological parsing
1. Regular Expressions

• A Regular Expression (RE) OR regexes for short, are a pattern

matching standard for string parsing and replacement.
• RE helps us to match or find other strings or sets of strings, using a
specialized syntax held in a pattern.
• Regular expressions are used to search texts in UNIX as well as in MS
WORD in identical way.
SOME SIMPLE REGULAR EXPRESSIONS
A Regular Expression is an algebraic formula whose value is a pattern consisting
of a set of strings called the Language of the expression.
CHARACTER CLASSES
• Regular expressions are case-sensitive.
• The pattern /s/ matches lower case ‘s’
• The pattern /S/ matches upper case ‘S’
• This problem can be solved by using the disjunction of the word ‘s’ and
‘S’
• The pattern /[sS]/ will match the string containing either ‘s’ or ‘S’

EXAMPLES
ANCHOR SYMBOLS
SOME SPECIAL CHARACTERS
PARTS OF REGULAR EXPRESSION
2. FINITE STATE AUTOMATA (FSA)
3. MORPHOLOGICAL PARSING
Example:
Hand-s-ful
Generating or
parsing with
FST lexicon and
rules
SPELLING ERROR
DETECTION AND CORRECTION
SPELLING DETECTION
Over 80% of the typing errors were single-error mis-spellings:

1. Substitution of a single letter: when a wrong letter replaces a right one.

Eg. “Error” ”Errpr”

2. Omission of a single letter: when a single character is omitted (deleted)

Eg. “concept” “concpt”

3. Insertion of a single letter: presence of a extra character

Eg. “error” “errorn”

4. Transposition/ Reversal of two adjacent letters: sequence of characters is

reversed
Eg. “are” “aer”
• Optical Character Recognition (OCR) and other automatic reading devices
introduces errors of substitution, deletion and insertion but not reversal.
• OCR errors are grouped into 5 classes:
1. Substitution : caused due to visual similarity (c e, 1 l , r n)
2. Multi-substitution (or framing): eg: m rn
3. Space deletion
4. Space insertion
5. Failure : OCR algorithm fails to select a letter with sufficient accuracy.
Spelling error: mainly phonetic

2 categories of errors:

1. Non-word error: when a resultant word does not appear in a

lexicon or is not a valid orthographic word form.
Solution : n-gram analysis and dictionary lookup
2. Real-world error: Occurs due to typographical mistakes or
spelling errors
eg. Piece peace , meat meet

Problems:
causes local syntactic errors, global syntactic errors, semantic
errors, errors at discourse or pragmatic levels
SPELLING CORRECTION
Spelling correction consists of detecting and correcting errors.
Error detection is the process of finding misspelt words.
Error Correction is the process of suggesting correct words to a misspelled one.
These problems are addressed in two ways:
1. Isolated error detection and correction:
Each word is checked separately independent of its context.
Problems:
a. The strategy requires existence of a lexicon containing all correct words.
b. Some languages are highly productive.
c. The strategy fails when spelling around produces a word and belongs to the
lexicon.
d. The larger the lexicon the more likely it is that an error goes undetected.
2. Context-dependent Error detection and correction:
The context of the word is utilised to detect and correct errors.
Categories of spelling correction algorithm.
1. Minimum edit distance : Minimum number of operations (insertions ,
deletions or substitutions) here to transform one string into another.
2. Similarity key techniques.
3. N gram based techniques.
4. Neural Nets
5. Rule based techniques

MINIMUM EDIT DISTANCE ALGORITHM

• The minimum edit distance is the number of insertions, deletions and
substitutions required to change one string into another.
• When we talk about distance between two strings, we are talking of the
minimum edit distance.
• Edit Distance between two strings can be represented as a binary
function, ed, which maps 2 strings to their edit distance. ed is symmetric.
For any 2 strings, s and t,
ed(s,t)=ed(t,s)
• Edit distance can be viewed as a string alignment problem.
• By aligning 2 strings, we can measure the degree to which they match.
• The alignment shown here between tutor and tumor has a distance of
two.
SUBSTITUTION INSERTION
DELETION INSERTION INSERTION

T U T O - R T U T - O - R
T U M O U R T U - M O U R
MINIMUM DISTANCE = 2 MINIMUM DISTANCE = 3
• The Levensthein distance between 2 sequences is obtained by assigning a
unit cost to each operation.
• The problem can be solved by using dynamic programming algorithms.
Which uses a table-driven approach to solve problems.
• The dynamic programming algorithm is implemented by creating an edit
distance matrix.
• The matrix has one row for each symbol in the source string and one
column for each symbol in the target string.
Computing minimum edit distance
WORDS AND WORD CLASSES
• Words are classified into categories called part of speech.
• These are sometimes called word classes or lexical categories.
• These lexical categories are usually defined by their syntactic and
morphological behaviors.
• The most common lexical categories are nouns and verbs. Lexical
categories include adjectives, adverbs, prepositions, and conjunctions.
• Word classes are further categorized as open and closed word classes.
1. Open Word classes constantly acquire new members.
Nouns, verbs (accept auxiliary verbs), adjectives, adverbs, and
interjections are open word classes
2. Closed word classes do not acquire new members.
• Prepositions, auxiliary verbs, delimiters, conjunction, and participles are
closed word classes.
Parts of speech example
POS( Part-of-Speech) Tagging

Spelling Error Detection Techniques
No ratings yet
Spelling Error Detection Techniques
3 pages
Spelling Error Detection Techniques
No ratings yet
Spelling Error Detection Techniques
96 pages
Spelling Error Detection in NLP
No ratings yet
Spelling Error Detection in NLP
9 pages
Unit 2 NLP
No ratings yet
Unit 2 NLP
7 pages
Error-Tolerant Lexical Processing in NLP
No ratings yet
Error-Tolerant Lexical Processing in NLP
14 pages
OCR Post-Correction Techniques Explained
No ratings yet
OCR Post-Correction Techniques Explained
29 pages
NLP Module2
No ratings yet
NLP Module2
9 pages
Show File-6
No ratings yet
Show File-6
4 pages
Lexical Analysis in NLP Explained
No ratings yet
Lexical Analysis in NLP Explained
17 pages
Components and Concepts of NLP
No ratings yet
Components and Concepts of NLP
18 pages
Word Level Analysis and Regex Techniques
No ratings yet
Word Level Analysis and Regex Techniques
114 pages
Spelling Correction in NLP Using Edit Distance
No ratings yet
Spelling Correction in NLP Using Edit Distance
3 pages
Spell Check Techniques in NLP
No ratings yet
Spell Check Techniques in NLP
29 pages
Regular Expressions and Morphological Parsing
No ratings yet
Regular Expressions and Morphological Parsing
28 pages
Understanding Finite State Transducers
No ratings yet
Understanding Finite State Transducers
12 pages
Word Level Analysis and Morphology
No ratings yet
Word Level Analysis and Morphology
78 pages
NLP
No ratings yet
NLP
1 page
Bayesian Models for Pronunciation Errors
No ratings yet
Bayesian Models for Pronunciation Errors
50 pages
Minimum Edit Distance in NLP Explained
No ratings yet
Minimum Edit Distance in NLP Explained
12 pages
Spelling Correction Techniques Explained
No ratings yet
Spelling Correction Techniques Explained
17 pages
Spelling Correction Techniques in Search
No ratings yet
Spelling Correction Techniques in Search
26 pages
Minimum Edit Distance in NLP
No ratings yet
Minimum Edit Distance in NLP
52 pages
Word Level Analysis in NLP
No ratings yet
Word Level Analysis in NLP
97 pages
Module 2
No ratings yet
Module 2
20 pages
Spelling Error Detection in NLP
No ratings yet
Spelling Error Detection in NLP
22 pages
Spell Checking Techniques and Errors
No ratings yet
Spell Checking Techniques and Errors
13 pages
Spelling Correction with Noisy Channel Model
No ratings yet
Spelling Correction with Noisy Channel Model
5 pages
Spelling Correction in NLP Systems
No ratings yet
Spelling Correction in NLP Systems
25 pages
E Search Searching
No ratings yet
E Search Searching
15 pages
Spell Checking Algorithms Overview
No ratings yet
Spell Checking Algorithms Overview
19 pages
Design of A Spell Corrector For Hausa Language
No ratings yet
Design of A Spell Corrector For Hausa Language
13 pages
Spelling Error Detection Methods in NLP
No ratings yet
Spelling Error Detection Methods in NLP
5 pages
Spell Checker Project Report
No ratings yet
Spell Checker Project Report
15 pages
NLP Module2 QP and Answer
No ratings yet
NLP Module2 QP and Answer
16 pages
NLP Challenges and Language Models
No ratings yet
NLP Challenges and Language Models
9 pages
Lec 4
No ratings yet
Lec 4
16 pages
Morphological Analysis and POS Tagging
No ratings yet
Morphological Analysis and POS Tagging
10 pages
Stemming vs. Lemmatization Explained
No ratings yet
Stemming vs. Lemmatization Explained
4 pages
Vietnamese Spelling Error Detection with BERT
No ratings yet
Vietnamese Spelling Error Detection with BERT
12 pages
Speech Processing and Phonetics Overview
No ratings yet
Speech Processing and Phonetics Overview
26 pages
Contextual Spelling Correction Using Levenshtein
No ratings yet
Contextual Spelling Correction Using Levenshtein
8 pages
Beyond Standardized Test Scores Handout Louisa Moats
No ratings yet
Beyond Standardized Test Scores Handout Louisa Moats
9 pages
Spelling Correction in NLP
No ratings yet
Spelling Correction in NLP
72 pages
Automatic Spelling Correction in Scientific and Scholarly Text
No ratings yet
Automatic Spelling Correction in Scientific and Scholarly Text
11 pages
NLP Chapter 2: Stemming, Lemmatization, Morphology
No ratings yet
NLP Chapter 2: Stemming, Lemmatization, Morphology
13 pages
8.chinese Phonemic
No ratings yet
8.chinese Phonemic
2 pages
Understanding Tokenization in NLP
No ratings yet
Understanding Tokenization in NLP
4 pages
Regular Expressions & Finite-State Automata
No ratings yet
Regular Expressions & Finite-State Automata
29 pages
Noisy Channel Model for Spelling Correction
No ratings yet
Noisy Channel Model for Spelling Correction
25 pages
NLP Concepts and Applications Overview
100% (1)
NLP Concepts and Applications Overview
72 pages
Teaching Spelling: Effective Approaches
No ratings yet
Teaching Spelling: Effective Approaches
58 pages
Comprehensive Guide to Spell Checking
No ratings yet
Comprehensive Guide to Spell Checking
1 page
Word Level Analysis in NLP
No ratings yet
Word Level Analysis in NLP
18 pages
Understanding Regular Expressions and FSA
No ratings yet
Understanding Regular Expressions and FSA
69 pages
NLP Techniques and Applications Overview
No ratings yet
NLP Techniques and Applications Overview
49 pages
NLP Tokenization and Challenges
No ratings yet
NLP Tokenization and Challenges
124 pages
Key Concepts in Language Modeling
No ratings yet
Key Concepts in Language Modeling
48 pages
Pronunciation Modeling for EFL Spell Checking
No ratings yet
Pronunciation Modeling for EFL Spell Checking
6 pages
Word Level Morphological Analysis in NLP
No ratings yet
Word Level Morphological Analysis in NLP
49 pages
Understanding Time and Aspect in English
No ratings yet
Understanding Time and Aspect in English
12 pages
A Head Movement Approach To Talmy's Typology
No ratings yet
A Head Movement Approach To Talmy's Typology
108 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
2 pages
Present Perfect-Unit 5
No ratings yet
Present Perfect-Unit 5
12 pages
Lesson Plan on Descriptive Words
No ratings yet
Lesson Plan on Descriptive Words
5 pages
Introduction to English Morphology
No ratings yet
Introduction to English Morphology
9 pages
Academic Writing: Process Language Tips
No ratings yet
Academic Writing: Process Language Tips
19 pages
KET Grammar Topics Overview
No ratings yet
KET Grammar Topics Overview
2 pages
Kermode's Response to Frank's Critique
No ratings yet
Kermode's Response to Frank's Critique
11 pages
Variations in Indian and East African English
No ratings yet
Variations in Indian and East African English
8 pages
Essential English Grammar Rules
No ratings yet
Essential English Grammar Rules
5 pages
Understanding the Passive Voice
No ratings yet
Understanding the Passive Voice
3 pages
Rencana Pembelajaran Bahasa Inggris SMP
No ratings yet
Rencana Pembelajaran Bahasa Inggris SMP
8 pages
Complete The Text Using The Simple Past Tense
100% (1)
Complete The Text Using The Simple Past Tense
2 pages
Gateway A2 Starter Unit Overview
0% (2)
Gateway A2 Starter Unit Overview
70 pages
Effective Speaking Assessment Strategies
No ratings yet
Effective Speaking Assessment Strategies
20 pages
Cuestionario de Evaluacion 8vo Ingles Tercer Trimestre
No ratings yet
Cuestionario de Evaluacion 8vo Ingles Tercer Trimestre
3 pages
Regular and Irregular Verbs PDF
100% (4)
Regular and Irregular Verbs PDF
2 pages
Overview of 12 English Verb Tenses
No ratings yet
Overview of 12 English Verb Tenses
5 pages
SEO Strategies for Business Owners
No ratings yet
SEO Strategies for Business Owners
8 pages
K-12 English Lesson Plan: Verbs Explained
No ratings yet
K-12 English Lesson Plan: Verbs Explained
8 pages
Student's Book: Key Features
No ratings yet
Student's Book: Key Features
13 pages
CEFR A2 Test Specification for Grades 5-6
No ratings yet
CEFR A2 Test Specification for Grades 5-6
4 pages
Verb Tense Practice Exercises
No ratings yet
Verb Tense Practice Exercises
4 pages
Simple Present Tense Negative Rules
No ratings yet
Simple Present Tense Negative Rules
4 pages
Beginner Chinese Mandarin A1 A2 Course
No ratings yet
Beginner Chinese Mandarin A1 A2 Course
5 pages
Writing Level 2: Exercise Solutions
No ratings yet
Writing Level 2: Exercise Solutions
75 pages
Future Technology in Grade 5 English
No ratings yet
Future Technology in Grade 5 English
20 pages
The New Kid: Adverbs Explained
No ratings yet
The New Kid: Adverbs Explained
11 pages

NLP Word Level Analysis Techniques

Uploaded by

NLP Word Level Analysis Techniques

Uploaded by

NATURAL LANGUAGE PROCESSING

• A Regular Expression (RE) OR regexes for short, are a pattern

1. Substitution of a single letter: when a wrong letter replaces a right one.

2. Omission of a single letter: when a single character is omitted (deleted)

3. Insertion of a single letter: presence of a extra character

4. Transposition/ Reversal of two adjacent letters: sequence of characters is

1. Non-word error: when a resultant word does not appear in a

MINIMUM EDIT DISTANCE ALGORITHM

You might also like