CSC 301 – Design and Analysis of Algorithms
Dynamic Programming – Edit Distance
1
What is Edit Distance?
Given: Two strings 𝑥 and 𝑦.
Goal: Transform 𝑥 into 𝑦 using edit operations (insertions, deletions, substitutions).
Example: 𝑥 = 𝑘𝑖𝑡𝑡𝑒𝑛 and 𝑦 = 𝑠𝑖𝑡𝑡𝑖𝑛𝑔
1st step: kitten →sitten (substitution)
k i t t e n -
2nd step: sitten→sittin (substitution)
s i t t i n g
3rd step: sittin→sitting (insertion) S S I
Can we do better? Edit Distance = 3
Answer here is no (obviously)
What about: 𝑥 = darladidirladada
𝑦 = marmelladara 2
Tough…
Why do we care?
A lot of applications depend on the similarity of two strings.
• Computational Biology:
• Align two sequences of nucleotides
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC Animal species from the same family
• Resulting alignment: are bound to have more similar DNAs
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Spelling Correction
• The user typed “Graffe”
• Which is the closest?
• graf
• graft
• grail
3
• Giraffe
• Also for Machine Translation, Information Extraction, Speech Recognition
Edit Distance
Levenshtein (1965) introduced edit distance between two strings as the minimum
number of elementary operations (insertions, deletions, and substitutions) to
transform one string into the other. Matches are not included in the count.
Example: transform vintner to writers
vintner replace v with w → wintner If each operation has cost of 1
wintner insert r after w → wrintner Edit Distance = 5
wrintner match i → wrintner
wrintner delete n → writner If substitutions cost 2 (Levenshtein)
writner match t → writner Edit Distance = 6
writner delete n → writer
writer match e → writer
4
writer match r → writer
writer insert s → writers
Minimum Edit Distance
But the space of all edit sequences is huge!
• We can’t afford to navigate naïvely
• Lots of distinct path wind up at the same state.
• We don’t have to keep track of all of them
• Dynamic programming: A bottom-up tabular computation of D(n,m)
• We compute D(i,j) for small i,j
• And compute larger D(i,j) based on previously computed smaller values
• i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
• use a traceback algorithm to find the optimal alignment
Notice that we can solve D(i,j) for all combination of lengths of prefixes of X and Y.
5
Edit Distance: Dynamic Programming Algorithm
EditDistance(x, y)
for i from 0 to n
d[i][0] = i;
for j from 0 to m
d[0][j] = j;
for i from 1 to n
for j from 1 to m
if x[i-1] == y[j-1]
cost = 0
else
cost = 1
d[i][j] = min( d[i-1][j] + 1, // deletion
d[i][j-1] + 1, // insertion
d[i-1][j-1] + cost) // substitution
return d[n][m] 6
Edit Distance: Example
• Given two strings: EXECUTION and INTENTION, determine the minimum edit
distance using dynamic programming.
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1
N 2
T 3
d[i][j] = min( d[i-1][j] + 1, // deletion
E 4 d[i][j-1] + 1, // insertion
d[i-1][j-1] + cost) // substitution
N 5
T 6
I 7
O 8
N 9 7
Edit Distance: Example
• Given two strings: EXECUTION and INTENTION, determine the minimum edit
distance using dynamic programming.
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1
N 2
T 3
E 4
N 5
T 6
I 7
O 8
N 9 8
Edit Distance: Example
• Given two strings: EXECUTION and INTENTION, determine the minimum edit
distance using dynamic programming.
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5 9
Computing Alignments
• Edit distance isn’t sufficient
• We often need to align each character of the two strings to each other
• We do this by keeping a “backtrace”
• Every time we enter a cell, remember where we came from
• When we reach the end,
• Trace back the path from the upper right corner to read off the alignment
10
Edit Distance: Example
• Given two strings: EXECUTION and INTENTION, determine the minimum edit
distance using dynamic programming.
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5 11
Edit Distance: Example
• Given two strings: EXECUTION and INTENTION, determine the minimum edit
distance using dynamic programming.
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5 12
Edit Distance: Example
- E X E C U T I O N
I N T E - N T I O N
I S S D S
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5 13
Edit Distance Performance
• Time:
O(nm)
• Space:
O(nm)
• Can be improved to O(n+m) using Hirschberg’s algorithm
• Backtrace:
O(n+m)
14
Some Interesting Properties
• Triangle Inequality: for any three strings x,y,z of arbitrary lengths:
ED(x,y) ≤ ED(x,z) + ED(z,y)
• Splitting Inequality: Let x, y be strings of lengths n and m respectively. For all i,j:
ED(x,y) ≤ ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])
• Upper and Lower Bounds: Let x, y be strings of lengths n, m (n ≤ m). Then:
• ED(x,y) ≤ m
• ED(x,y) ≥ m-n
• ED(x,y)=0 iff x=y
• if m=n, ED(x,y) ≤ HD(x,y)
• ED(x,y) ≥ number of characters (not counting duplicates) found in x but not in y
15
Types of Edit Distance Metrics
• The Levenshtein distance allows deletion, insertion and substitution.
• The Longest common subsequence (LCS) distance allows only insertion and deletion,
not substitution.
• The Hamming distance allows only substitution; hence, it only applies to strings of
the same length.
• The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the
transposition of two adjacent characters.
• The Jaro distance allows only transposition.
16
THANK YOU
17