0% found this document useful (0 votes)
13 views27 pages

Spelling Error Detection and Correction

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views27 pages

Spelling Error Detection and Correction

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Detection and Correction of

Spelling Errors
 The detection and correction of spelling errors is
an integral part of modern word-processors.
 The very same algorithms are also important in
applications in which even the individual letters
aren’t guaranteed to be accurately identified:
optical character recognition (OCR) and on-
line handwriting recognition.
 Optical character recognition is the term used
for automatic recognition of machine or hand-
printed characters.
 On-line handwriting recognition is the
recognition of human printed or cursive
handwriting as the user is writing
 Spelling correction, breaks the field down into
three increasingly broader problems:
 1. non-word error detection: detecting spelling
errors which result in non-words (like graffe for
giraffe).
 2. isolated-word error correction: correcting
spelling errors which result in non-words, for
example correcting graffe to giraffe, but looking
only at the word in isolation.
 3. context-dependent error detection and
correction: Using the context to help detect and
correct spelling errors even if they accidentally
result in an actual word of English (real-word
errors).
 This can happen from typographical errors
(insertion, deletion, transposition) which
accidently produce a real word (e.g. there for
three), or
 because the writer substituted the wrong spelling
of a homophone or near-homophone (e.g.
dessert for desert, or piece for peace).
Minimum Edit Distance

Most slides adapted from slides by Dan Jurafsky, Stanford University


 String distance: The distance between two
strings is a measure of how alike two strings are
to each other
How similar are two strings?

 Spell correction
The user typed • Computational Biology
“graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
 graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
 graft • Resulting alignment:
 grail -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
 giraffe TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Also for Machine Translation, Information Extraction, Speech


Recognition
Edit Distance
 The minimum edit distance between two strings
 Is the minimum number of editing operations
 Insertion
 Deletion
 Substitution

 Needed to transform one into the other


Minimum Edit Distance
 Two strings and their alignment:
Minimum Edit Distance

 If each operation has cost of 1


 Distance between these is 5
 If substitutions cost 2 (Levenshtein)
 Distance between them is 8
How to find the Min Edit Distance?
 Searching for a path (sequence of edits) from the
start string to the final string:
 Initial state: the word we’re transforming
 Operators: insert, delete, substitute
 Goal state: the word we’re trying to get to
 Path cost: what we want to minimize: the number of
edits
Minimum Edit Distance as Search
 But the space of all edit sequences is huge!
 We can’t afford to navigate naïvely
 Lots of distinct paths wind up at the same state.
 We don’t have to keep track of all of them

 Instead, we use dynamic programming, in a


way very similar to the LCS problem…
Defining Edit Distance Parameters
 For two strings
 X of length n
 Y of length m

 We define D(i,j)
 the edit distance between X[1..i] and Y[1..j]
 i.e., the first i characters of X and the first j characters of Y
 The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Edit Distance
 Dynamic programming: A tabular computation of
D(n,m)
 Solving problems by combining solutions to
subproblems.
 Bottom-up
 We compute D(i,j) for small i,j
 And compute larger D(i,j) based on previously
computed smaller values
 i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance
(Levenshtein)
 Initialization
D(i,0) = i
D(0,j) = j
 Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)

0; if X(i) = Y(j)
 Termination:
D(N,M) is distance
The Edit Distance Table

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Computing alignments
 Edit distance isn’t sufficient
 We often need to align each character of the two
strings to each other
 We do this by keeping a “backtrace”
 Every time we enter a cell, remember where we
came from
 When we reach the end,
 Trace back the path from the upper right corner to
read off the alignment
Edit Distance

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance

 Base conditions: Termination:


D(i,0) = i D(0,j) = j D(N,M) is distance
 Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT
insertion
ptr(i,j)= DOWN
deletion
DIAG
substitution
Performance
 Time:

O(nm)
 Space:

O(nm)
 Can be improved to O(n+m) using Hirschberg’s
algorithm

 Backtrace:

O(n+m)
Types of Edit Distance Metrics
 The Levenshtein distance allows deletion,
insertion and substitution.
 The Longest common subsequence (LCS)
distance allows only insertion and deletion, not
substitution.
 The Hamming distance allows only substitution;
hence, it only applies to strings of the same length.
 The Damerau–Levenshtein distance allows
insertion, deletion, substitution, and the
transposition of two adjacent characters.
 The Jaro distance allows only transposition.
Common Algorithm

This doesn’t include transposition – how could we add this operation?

From [Link]

You might also like