Edit Distance
Most slides adapted from slides by Dan Jurafsky, Stanford University
How similar are two strings?
Spell correction
The user typed • Computational Biology
“graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
graft • Resulting alignment:
grail -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
giraffe TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Also for Machine Translation, Information Extraction, Speech
Recognition
Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
If each operation has cost of 1
Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between them is 8
Alignment in Computational Biology
Given a sequence of bases
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
An alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Given two sequences, align each letter to a letter or
gap
Uses of Edit Distance in NLP
Evaluating Machine Translation and speech
recognition
R Spokesman confirms senior government adviser was appointed
H Spokesman said the senior adviser was appointed
S I D I
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the
start string to the final string:
Initial state: the word we’re transforming
Operators: insert, delete, substitute
Goal state: the word we’re trying to get to
Path cost: what we want to minimize: the number of
edits
Minimum Edit Distance as Search
But the space of all edit sequences is huge!
We can’t afford to navigate naïvely
Lots of distinct paths wind up at the same state.
We don’t have to keep track of all of them
Instead, we use dynamic programming, in a
way very similar to the LCS problem…
Defining Edit Distance Parameters
For two strings
X of length n
Y of length m
We define D(i,j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Edit Distance
Dynamic programming: A tabular computation of
D(n,m)
Solving problems by combining solutions to
subproblems.
Bottom-up
We compute D(i,j) for small i,j
And compute larger D(i,j) based on previously
computed smaller values
i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance
(Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
Computing alignments
Edit distance isn’t sufficient
We often need to align each character of the two
strings to each other
We do this by keeping a “backtrace”
Every time we enter a cell, remember where we
came from
When we reach the end,
Trace back the path from the upper right corner to
read off the alignment
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT
insertion
ptr(i,j)= DOWN
deletion
DIAG
substitution
Performance
Time:
O(nm)
Space:
O(nm)
Can be improved to O(n+m) using Hirschberg’s
algorithm
Backtrace:
O(n+m)
Types of Edit Distance Metrics
The Levenshtein distance allows deletion,
insertion and substitution.
The Longest common subsequence (LCS)
distance allows only insertion and deletion, not
substitution.
The Hamming distance allows only substitution;
hence, it only applies to strings of the same length.
The Damerau–Levenshtein distance allows
insertion, deletion, substitution, and the
transposition of two adjacent characters.
The Jaro distance allows only transposition.
Common Algorithm
This doesn’t include transposition – how could we add this operation?
From [Link]