NATURAL LANGUAGE PROCESSING
(NLP)
MODULE 2
Word level Analysis
1 Regular Expression
2. Finite state Automata
3. Morphological parsing
1. Regular Expressions
• A Regular Expression (RE) OR regexes for short, are a pattern
matching standard for string parsing and replacement.
• RE helps us to match or find other strings or sets of strings, using a
specialized syntax held in a pattern.
• Regular expressions are used to search texts in UNIX as well as in MS
WORD in identical way.
SOME SIMPLE REGULAR EXPRESSIONS
A Regular Expression is an algebraic formula whose value is a pattern consisting
of a set of strings called the Language of the expression.
CHARACTER CLASSES
• Regular expressions are case-sensitive.
• The pattern /s/ matches lower case ‘s’
• The pattern /S/ matches upper case ‘S’
• This problem can be solved by using the disjunction of the word ‘s’ and
‘S’
• The pattern /[sS]/ will match the string containing either ‘s’ or ‘S’
EXAMPLES
ANCHOR SYMBOLS
SOME SPECIAL CHARACTERS
PARTS OF REGULAR EXPRESSION
2. FINITE STATE AUTOMATA (FSA)
3. MORPHOLOGICAL PARSING
Example:
Hand-s-ful
Generating or
parsing with
FST lexicon and
rules
SPELLING ERROR
DETECTION AND CORRECTION
SPELLING DETECTION
Over 80% of the typing errors were single-error mis-spellings:
1. Substitution of a single letter: when a wrong letter replaces a right one.
Eg. “Error” ”Errpr”
2. Omission of a single letter: when a single character is omitted (deleted)
Eg. “concept” “concpt”
3. Insertion of a single letter: presence of a extra character
Eg. “error” “errorn”
4. Transposition/ Reversal of two adjacent letters: sequence of characters is
reversed
Eg. “are” “aer”
• Optical Character Recognition (OCR) and other automatic reading devices
introduces errors of substitution, deletion and insertion but not reversal.
• OCR errors are grouped into 5 classes:
1. Substitution : caused due to visual similarity (c e, 1 l , r n)
2. Multi-substitution (or framing): eg: m rn
3. Space deletion
4. Space insertion
5. Failure : OCR algorithm fails to select a letter with sufficient accuracy.
Spelling error: mainly phonetic
2 categories of errors:
1. Non-word error: when a resultant word does not appear in a
lexicon or is not a valid orthographic word form.
Solution : n-gram analysis and dictionary lookup
2. Real-world error: Occurs due to typographical mistakes or
spelling errors
eg. Piece peace , meat meet
Problems:
causes local syntactic errors, global syntactic errors, semantic
errors, errors at discourse or pragmatic levels
SPELLING CORRECTION
Spelling correction consists of detecting and correcting errors.
Error detection is the process of finding misspelt words.
Error Correction is the process of suggesting correct words to a misspelled one.
These problems are addressed in two ways:
1. Isolated error detection and correction:
Each word is checked separately independent of its context.
Problems:
a. The strategy requires existence of a lexicon containing all correct words.
b. Some languages are highly productive.
c. The strategy fails when spelling around produces a word and belongs to the
lexicon.
d. The larger the lexicon the more likely it is that an error goes undetected.
2. Context-dependent Error detection and correction:
The context of the word is utilised to detect and correct errors.
Categories of spelling correction algorithm.
1. Minimum edit distance : Minimum number of operations (insertions ,
deletions or substitutions) here to transform one string into another.
2. Similarity key techniques.
3. N gram based techniques.
4. Neural Nets
5. Rule based techniques
MINIMUM EDIT DISTANCE ALGORITHM
• The minimum edit distance is the number of insertions, deletions and
substitutions required to change one string into another.
• When we talk about distance between two strings, we are talking of the
minimum edit distance.
• Edit Distance between two strings can be represented as a binary
function, ed, which maps 2 strings to their edit distance. ed is symmetric.
For any 2 strings, s and t,
ed(s,t)=ed(t,s)
• Edit distance can be viewed as a string alignment problem.
• By aligning 2 strings, we can measure the degree to which they match.
• The alignment shown here between tutor and tumor has a distance of
two.
SUBSTITUTION INSERTION
DELETION INSERTION INSERTION
T U T O - R T U T - O - R
T U M O U R T U - M O U R
MINIMUM DISTANCE = 2 MINIMUM DISTANCE = 3
• The Levensthein distance between 2 sequences is obtained by assigning a
unit cost to each operation.
• The problem can be solved by using dynamic programming algorithms.
Which uses a table-driven approach to solve problems.
• The dynamic programming algorithm is implemented by creating an edit
distance matrix.
• The matrix has one row for each symbol in the source string and one
column for each symbol in the target string.
Computing minimum edit distance
WORDS AND WORD CLASSES
• Words are classified into categories called part of speech.
• These are sometimes called word classes or lexical categories.
• These lexical categories are usually defined by their syntactic and
morphological behaviors.
• The most common lexical categories are nouns and verbs. Lexical
categories include adjectives, adverbs, prepositions, and conjunctions.
• Word classes are further categorized as open and closed word classes.
1. Open Word classes constantly acquire new members.
Nouns, verbs (accept auxiliary verbs), adjectives, adverbs, and
interjections are open word classes
2. Closed word classes do not acquire new members.
• Prepositions, auxiliary verbs, delimiters, conjunction, and participles are
closed word classes.
Parts of speech example
POS( Part-of-Speech) Tagging