Edit Distance
Edit Distance or Levenshtein Distance is minimum number
of single character edits (insertion, deletion, or substitution)
required to transform from one string to another string
• Edit distance is very useful in bioinformatics
• It is used to compare DNA, RNA – to identify genetic similarity
F O C S
S T O P S
F S O C S
Substitution
T O P S
F S O C S Substitution
T O P S
F S T O C S Substitution
Insertion
O P S
F S T O C S Substitution
Insertion
O P S
F S T O C S Substitution
Insertion
Match
P S
F S T O C S Substitution
Insertion
Match
P S
F S T O C P S Substitution
Insertion
Substitution
Match
S
F S T O C P S Substitution
Insertion
Match
Match
Substitution
F S T O C P S Substitution
Insertion
Edit Distance or Levenshtein Distance is minimum number of single character edits Match
(insertion, deletion, or substitution) required to transform from one string to
another string
Substitution
Match
F S T O C P S
Substitution
Insertion
Edit Distance or Levenshtein Distance is minimum number of single character edits
(insertion, deletion, or substitution) required to transform from one string to
another string
Substitution
Edit Distance = 3
How would you solve the Edit Distance
Problem?
Let us assume that we want to find the edit distance between
A[1…n] and B[1…m]
• Let EDIT(i,j) be the edit distance between A[1..i] and B[1…j]
• We want to find EDIT(n,m)
Look at the last step of edit distance in our example.
F O C S
S T O P S
F O C S
S T O P
F O C S
S T O P
Match
EDIT(𝑛 − 1, 𝑚 − 1)
if the last step is a match:
EDIT(𝑛, 𝑚) =
F O C S
S T O P
Match
if the last step is a match:
EDIT(𝑛, 𝑚) = EDIT(𝑛 − 1, 𝑚 − 1)
F O C D
S T O P S
F O C D S
S T O P
F O C D S
S T O P
EDIT(𝑛 − 1, 𝑚 − 1) Substitution
if the last step is a substitution:
EDIT(𝑛, 𝑚) =
F O C D S
S T O P
if the last step is a substitution:
EDIT(𝑛, 𝑚) Substitution
= 1+EDIT(𝑛 − 1, 𝑚 − 1)
F O C P S
S T O P
F O C P S
S T O P
EDIT(𝑛 − 1, 𝑚) Deletion
if the last step is a deletion:
EDIT(𝑛, 𝑚) =
F O C P S
S T O P
Deletion
if the last step is a deletion:
EDIT(𝑛, 𝑚) = 1+EDIT(𝑛 − 1, 𝑚)
F O C P
S T O P S
F O C P S
S T O P
EDIT(𝑛, 𝑚 − 1) Insertion
if the last step is an insertion:
EDIT(𝑛, 𝑚) =
F O C P S
S T O P
Insertion
if the last step is an insertion :
EDIT(𝑛, 𝑚) = 1+EDIT(𝑛, 𝑚 − 1)
if the last step is a match:
EDIT(𝑛, 𝑚) = EDIT(𝑛 − 1,−𝑚1,−𝑚1)− 1)
EDIT(𝑛
if the last step is a substitution :
EDIT(𝑛, 𝑚) = 1+ 1+ EDIT(𝑛
EDIT(𝑛 −−1, 1,
𝑚𝑚 −−
1)1)
if the last step is a deletion:
EDIT(𝑛, 𝑚) = 1+EDIT(𝑛
1+ EDIT(𝑛−−1,1,𝑚)
𝑚)
if the last step is an insertion:
EDIT(𝑛, 𝑚) = 1+ 1+EDIT(𝑛,
EDIT(𝑛,𝑚𝑚−−1)
1)
But we donot know what is the last step
• But it is one of the above four possibilities
• Try all possibilities
EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 = 𝐴[𝑚]
1+ EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 ≠ 𝐴[𝑚]
EDIT(𝑛, 𝑚) min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚)
What is the base case?
EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 = 𝐴[𝑚]
1+ EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 ≠ 𝐴[𝑚]
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚)
if 𝑛 = 0:
EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 = 𝐴[𝑚]
1+ EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 ≠ 𝐴[𝑚]
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚)
if 𝑛 = 0:
return 𝑚
EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 = 𝐴[𝑚]
1+ EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 ≠ 𝐴[𝑚]
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
What is the running time of
def EDIT(𝑛, 𝑚)
this algorithm?
if 𝑛 = 0:
return 𝑚
if 𝑚 = 0:
• Exponential.
return 𝑛 • For example, EDIT(2,3) may be called exponential
number of times.
EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 = 𝐴[𝑚]
1+ EDIT(𝑛 − 1, 𝑚 − 1) if 𝐴 𝑛 ≠ 𝐴[𝑚]
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚)
if 𝑛 = 0:
return 𝑚
• We will store EDIT(i,j) in edit[i,j]
if 𝑚 = 0: • Initially, edit[i,j] is NULL for all
return 𝑛
possible (i,j) pair
EDIT(𝑛 − 1, 𝑚 − 1) • for any (0,j), we can set edit[0,j]
:= j
1+ EDIT(𝑛 − 1, 𝑚 − 1)
• for any (i,0), we can set edit[i,0]
return min
1+ EDIT(𝑛 − 1, 𝑚)
:=i
• We now modify the algorithm
1+ EDIT(𝑛, 𝑚 − 1) using memoization
def EDIT(𝑛, 𝑚)
if edit[𝑛, 𝑚] is null
• We will store EDIT(i,j) in edit[i,j]
EDIT(𝑛 − 1, 𝑚 − 1)
• Initially, edit[i,j] is NULL for all
1+ EDIT(𝑛 − 1, 𝑚 − 1)
possible (i,j) pair
edit[𝑛, 𝑚] = min • for any (0,j), we can set edit[0,j]
1+ EDIT(𝑛 − 1, 𝑚) := j
• for any (i,0), we can set edit[i,0]
1+ EDIT(𝑛, 𝑚 − 1)
:=i
• We now modify the algorithm
using memoization
def EDIT(𝑛, 𝑚)
if edit[𝑛, 𝑚] is null
• We will store EDIT(i,j) in edit[i,j]
EDIT(𝑛 − 1, 𝑚 − 1)
• Initially, edit[i,j] is NULL for all
1+ EDIT(𝑛 − 1, 𝑚 − 1)
possible (i,j) pair
edit[𝑛, 𝑚] = min • for any (0,j), we can set edit[0,j]
1+ EDIT(𝑛 − 1, 𝑚) := j
• for any (i,0), we can set edit[i,0]
1+ EDIT(𝑛, 𝑚 − 1)
:=i
• We now modify the algorithm
return edit[𝑛, 𝑚]
using memoization
Home Work
• Show that the running time of the algorithm is O(𝑛𝑚)
• Prove the following lemma for correctness:
edit[n,m] correctly stores the edit distance between
A[1…n] and B[1…m] using induction
on n and m
• Remember that every dynamic programming algorithm can be
represented using an implicit DAG
• This DAG also gives us the execution order of the non-
recursive algorithm
• What is the DAG for this problem?
3,3
2,2 2,3
3,2 3,3
1,1 1,2 1,3
2,1 2,2 2,3
3,1 3,2 3,3
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
0,0 0,1 0,2 0,3 base case
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
0,0 0,1 0,2 0,3 base case
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
0,0 0,1 0,2 0,3 base case
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
base case
0,0 0,1 0,2 0,3 base case
1,0 1,1 1,2 1,3 Solve second row
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
base case
0,0 0,1 0,2 0,3 base case
1,0 1,1 1,2 1,3 Solve second row
2,0 2,1 2,2 2,3 Solve third row
3,0 3,1 3,2 3,3 Solve fourth row
base case
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0:
return 𝑚
if 𝑚 = 0:
return 𝑛
EDIT(𝑛 − 1, 𝑚 − 1)
1+ EDIT(𝑛 − 1, 𝑚 − 1)
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0: for 𝑖 = 0 to 𝑚:
return 𝑚 e𝑑𝑖𝑡 0, 𝑖 = 𝑖
if 𝑚 = 0:
return 𝑛
EDIT(𝑛 − 1, 𝑚 − 1)
1+ EDIT(𝑛 − 1, 𝑚 − 1)
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0: for 𝑖 = 0 to 𝑚:
return 𝑚 e𝑑𝑖𝑡 0, 𝑖 = 𝑖
if 𝑚 = 0: for 𝑖 = 0 to 𝑛:
return 𝑛 e𝑑𝑖𝑡 𝑖, 0 = 𝑖
for 𝑖 = 1 to 𝑛:
EDIT(𝑛 − 1, 𝑚 − 1) for 𝑗 = 1 to 𝑚:
e𝑑𝑖𝑡 𝑖, 𝑗 = ∞
1+ EDIT(𝑛 − 1, 𝑚 − 1)
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0: for 𝑖 = 0 to 𝑚:
return 𝑚 e𝑑𝑖𝑡 0, 𝑖 = 𝑖
if 𝑚 = 0: for 𝑖 = 0 to 𝑛:
return 𝑛 e𝑑𝑖𝑡 𝑖, 0 = 𝑖
for 𝑖 = 1 to 𝑛:
EDIT(𝑛 − 1, 𝑚 − 1) for 𝑗 = 1 to 𝑚:
e𝑑𝑖𝑡 𝑖, 𝑗 = ∞
if 𝐴 𝑖 = 𝐵[𝑗]:
1+ EDIT(𝑛 − 1, 𝑚 − 1) e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡 𝑖 − 1, 𝑗 − 1
return min
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0: for 𝑖 = 0 to 𝑚:
return 𝑚 e𝑑𝑖𝑡 0, 𝑖 = 𝑖
if 𝑚 = 0: for 𝑖 = 0 to 𝑛:
return 𝑛 e𝑑𝑖𝑡 𝑖, 0 = 𝑖
for 𝑖 = 1 to 𝑛:
EDIT(𝑛 − 1, 𝑚 − 1) for 𝑗 = 1 to 𝑚:
e𝑑𝑖𝑡 𝑖, 𝑗 = ∞
if 𝐴 𝑖 = 𝐵[𝑗]:
1+ EDIT(𝑛 − 1, 𝑚 − 1) e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1)
if 𝐴 𝑖 ≠ 𝐵[𝑗]:
return min
e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1) + 1
1+ EDIT(𝑛 − 1, 𝑚)
1+ EDIT(𝑛, 𝑚 − 1)
def EDIT(𝑛, 𝑚) def EditDistance(𝑛, 𝑚)
if 𝑛 = 0: for 𝑖 = 0 to 𝑚:
return 𝑚 e𝑑𝑖𝑡 0, 𝑖 = 𝑖
if 𝑚 = 0: for 𝑖 = 0 to 𝑛:
return 𝑛 e𝑑𝑖𝑡 𝑖, 0 = 𝑖
for 𝑖 = 1 to 𝑛:
EDIT(𝑛 − 1, 𝑚 − 1) for 𝑗 = 1 to 𝑚:
e𝑑𝑖𝑡 𝑖, 𝑗 = ∞
if 𝐴 𝑖 = 𝐵[𝑗]:
1+ EDIT(𝑛 − 1, 𝑚 − 1) e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1)
if 𝐴 𝑖 ≠ 𝐵[𝑗]:
return min
e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1) + 1
1+ EDIT(𝑛 − 1, 𝑚)
e𝑑𝑖𝑡 𝑖, 𝑗 = min{e𝑑𝑖𝑡 𝑖, 𝑗 , e𝑑𝑖𝑡(𝑖 − 1, 𝑗) + 1}
1+ EDIT(𝑛, 𝑚 − 1) e𝑑𝑖𝑡 𝑖, 𝑗 = min{e𝑑𝑖𝑡 𝑖, 𝑗 , e𝑑𝑖𝑡(𝑖, 𝑗 − 1) + 1}
def EditDistance(𝑛, 𝑚)
for 𝑖 = 0 to 𝑚:
e𝑑𝑖𝑡 0, 𝑖 = 𝑖
for 𝑖 = 0 to 𝑛:
e𝑑𝑖𝑡 𝑖, 0 = 𝑖
for 𝑖 = 1 to 𝑛:
for 𝑗 = 1 to 𝑚:
• It is easy to see that the running
e𝑑𝑖𝑡 𝑖, 𝑗 = ∞
if 𝐴 𝑖 = 𝐵[𝑗]:
time of this algorithm is O(nm)
e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1) • Prove its correctness too
if 𝐴 𝑖 ≠ 𝐵[𝑗]:
e𝑑𝑖𝑡 𝑖, 𝑗 = e𝑑𝑖𝑡(𝑖 − 1, 𝑗 − 1) + 1
e𝑑𝑖𝑡 𝑖, 𝑗 = min{e𝑑𝑖𝑡 𝑖, 𝑗 , e𝑑𝑖𝑡(𝑖 − 1, 𝑗) + 1}
e𝑑𝑖𝑡 𝑖, 𝑗 = min{e𝑑𝑖𝑡 𝑖, 𝑗 , e𝑑𝑖𝑡(𝑖, 𝑗 − 1) + 1}
Some Recent Progress in the edit distance problem
• Backurs and Indyk (STOC 2015) showed that Edit Distance
2
cannot be solved in o(𝑛 ) time (when |A| = |B| = n) if some
widely believed conjectures are true.
• The paper can be read by even an undergraduate. If you are
motivated to do research, you can read it here.
• Edit distance in streaming and dynamic setting are being
studied recently.