Detection and Correction of
Spelling Errors
The detection and correction of spelling errors is
an integral part of modern word-processors.
The very same algorithms are also important in
applications in which even the individual letters
aren’t guaranteed to be accurately identified:
optical character recognition (OCR) and on-
line handwriting recognition.
Optical character recognition is the term used
for automatic recognition of machine or hand-
printed characters.
On-line handwriting recognition is the
recognition of human printed or cursive
handwriting as the user is writing
Spelling correction, breaks the field down into
three increasingly broader problems:
1. non-word error detection: detecting spelling
errors which result in non-words (like graffe for
giraffe).
2. isolated-word error correction: correcting
spelling errors which result in non-words, for
example correcting graffe to giraffe, but looking
only at the word in isolation.
3. context-dependent error detection and
correction: Using the context to help detect and
correct spelling errors even if they accidentally
result in an actual word of English (real-word
errors).
This can happen from typographical errors
(insertion, deletion, transposition) which
accidently produce a real word (e.g. there for
three), or
because the writer substituted the wrong spelling
of a homophone or near-homophone (e.g.
dessert for desert, or piece for peace).
Minimum Edit Distance
Most slides adapted from slides by Dan Jurafsky, Stanford University
String distance: The distance between two
strings is a measure of how alike two strings are
to each other
How similar are two strings?
Spell correction
The user typed • Computational Biology
“graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
graft • Resulting alignment:
grail -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
giraffe TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Also for Machine Translation, Information Extraction, Speech
Recognition
Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
If each operation has cost of 1
Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between them is 8
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the
start string to the final string:
Initial state: the word we’re transforming
Operators: insert, delete, substitute
Goal state: the word we’re trying to get to
Path cost: what we want to minimize: the number of
edits
Minimum Edit Distance as Search
But the space of all edit sequences is huge!
We can’t afford to navigate naïvely
Lots of distinct paths wind up at the same state.
We don’t have to keep track of all of them
Instead, we use dynamic programming, in a
way very similar to the LCS problem…
Defining Edit Distance Parameters
For two strings
X of length n
Y of length m
We define D(i,j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Edit Distance
Dynamic programming: A tabular computation of
D(n,m)
Solving problems by combining solutions to
subproblems.
Bottom-up
We compute D(i,j) for small i,j
And compute larger D(i,j) based on previously
computed smaller values
i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance
(Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Computing alignments
Edit distance isn’t sufficient
We often need to align each character of the two
strings to each other
We do this by keeping a “backtrace”
Every time we enter a cell, remember where we
came from
When we reach the end,
Trace back the path from the upper right corner to
read off the alignment
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT
insertion
ptr(i,j)= DOWN
deletion
DIAG
substitution
Performance
Time:
O(nm)
Space:
O(nm)
Can be improved to O(n+m) using Hirschberg’s
algorithm
Backtrace:
O(n+m)
Types of Edit Distance Metrics
The Levenshtein distance allows deletion,
insertion and substitution.
The Longest common subsequence (LCS)
distance allows only insertion and deletion, not
substitution.
The Hamming distance allows only substitution;
hence, it only applies to strings of the same length.
The Damerau–Levenshtein distance allows
insertion, deletion, substitution, and the
transposition of two adjacent characters.
The Jaro distance allows only transposition.
Common Algorithm
This doesn’t include transposition – how could we add this operation?
From [Link]