0% found this document useful (0 votes)
21 views23 pages

Understanding Edit Distance Metrics

The document discusses the concept of edit distance, which measures the similarity between two strings by calculating the minimum number of editing operations (insertion, deletion, substitution) needed to transform one string into another. It highlights applications in spell correction, computational biology, machine translation, and speech recognition, and explains how dynamic programming can be used to efficiently compute edit distances. Additionally, it introduces various types of edit distance metrics, including Levenshtein, Hamming, and Damerau-Levenshtein distances.

Uploaded by

gehaco7189
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views23 pages

Understanding Edit Distance Metrics

The document discusses the concept of edit distance, which measures the similarity between two strings by calculating the minimum number of editing operations (insertion, deletion, substitution) needed to transform one string into another. It highlights applications in spell correction, computational biology, machine translation, and speech recognition, and explains how dynamic programming can be used to efficiently compute edit distances. Additionally, it introduces various types of edit distance metrics, including Levenshtein, Hamming, and Damerau-Levenshtein distances.

Uploaded by

gehaco7189
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Edit Distance

Most slides adapted from slides by Dan Jurafsky, Stanford University


How similar are two strings?

 Spell correction
The user typed • Computational Biology
“graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
 graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
 graft • Resulting alignment:
 grail -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
 giraffe TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Also for Machine Translation, Information Extraction, Speech


Recognition
Edit Distance
 The minimum edit distance between two strings
 Is the minimum number of editing operations
 Insertion
 Deletion
 Substitution

 Needed to transform one into the other


Minimum Edit Distance
 Two strings and their alignment:
Minimum Edit Distance

 If each operation has cost of 1


 Distance between these is 5
 If substitutions cost 2 (Levenshtein)
 Distance between them is 8
Alignment in Computational Biology

 Given a sequence of bases


AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
 An alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

 Given two sequences, align each letter to a letter or


gap
Uses of Edit Distance in NLP

 Evaluating Machine Translation and speech


recognition
R Spokesman confirms senior government adviser was appointed
H Spokesman said the senior adviser was appointed
S I D I
How to find the Min Edit Distance?
 Searching for a path (sequence of edits) from the
start string to the final string:
 Initial state: the word we’re transforming
 Operators: insert, delete, substitute
 Goal state: the word we’re trying to get to
 Path cost: what we want to minimize: the number of
edits
Minimum Edit Distance as Search
 But the space of all edit sequences is huge!
 We can’t afford to navigate naïvely
 Lots of distinct paths wind up at the same state.
 We don’t have to keep track of all of them

 Instead, we use dynamic programming, in a


way very similar to the LCS problem…
Defining Edit Distance Parameters
 For two strings
 X of length n
 Y of length m

 We define D(i,j)
 the edit distance between X[1..i] and Y[1..j]
 i.e., the first i characters of X and the first j characters of Y
 The edit distance between X and Y is thus D(n,m)
Dynamic Programming for Edit Distance
 Dynamic programming: A tabular computation of
D(n,m)
 Solving problems by combining solutions to
subproblems.
 Bottom-up
 We compute D(i,j) for small i,j
 And compute larger D(i,j) based on previously
computed smaller values
 i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance
(Levenshtein)
 Initialization
D(i,0) = i
D(0,j) = j
 Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)

0; if X(i) = Y(j)
 Termination:
D(N,M) is distance
The Edit Distance Table

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
Computing alignments
 Edit distance isn’t sufficient
 We often need to align each character of the two
strings to each other
 We do this by keeping a “backtrace”
 Every time we enter a cell, remember where we
came from
 When we reach the end,
 Trace back the path from the upper right corner to
read off the alignment
Edit Distance

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance

 Base conditions: Termination:


D(i,0) = i D(0,j) = j D(N,M) is distance
 Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT
insertion
ptr(i,j)= DOWN
deletion
DIAG
substitution
Performance
 Time:

O(nm)
 Space:

O(nm)
 Can be improved to O(n+m) using Hirschberg’s
algorithm

 Backtrace:

O(n+m)
Types of Edit Distance Metrics
 The Levenshtein distance allows deletion,
insertion and substitution.
 The Longest common subsequence (LCS)
distance allows only insertion and deletion, not
substitution.
 The Hamming distance allows only substitution;
hence, it only applies to strings of the same length.
 The Damerau–Levenshtein distance allows
insertion, deletion, substitution, and the
transposition of two adjacent characters.
 The Jaro distance allows only transposition.
Common Algorithm

This doesn’t include transposition – how could we add this operation?

From [Link]

You might also like