String Matching?
String Matching consists of finding all of the occurrences of
given string in a text
Some of the Applications are
• Finding patterns in documents formed using a large
alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in unix
– grep searches for lines matching a pattern.
Pattern Matching
• Given a text string T[0..n-1] and a pattern
P[0..m-1], find all occurrences of the pattern
within the text.
• Example: T = ababcabdabcaabc and P = abc, the
occurrences are:
– first occurrence starts at T[3]
– second occurrence starts at T[9]
– third occurrence starts at T[13]
Let Σ denotes the set of alphabet .
• Given:
A string of alphabets T[1..n] of size “n” and a pattern P[1..m] of
size “m”
where, m<n.
• To Find:
Whether the pattern P occurs in text T or not. If it does, then give
the first occurrence of P in T.
The alphabets of both T and P are drawn from finite set Σ.
ALGORITM USED FOR STRING MATCHING
• The NAIVE STRING-MATCHER ALGORITHM
• The KNUTH-MORRIS-PRATT ALGORITHM
• The RABIN-KARP ALGORITHM
• STRING MATCHING WITH FINITE AUTOMATA ALGORITHM
The NAIVE-STRING-MATCHER algorithm
The naive algorithm finds all valid shifts using a loop that checks
the condition P[1 . . m] = T[s + 1 . . s + m] for each of the n - m + 1
possible values of s.
NAIVE-STRING-MATCHER(T, P)
1n length[T]
2m length[P]
3 for s 0 to n – m
4 do if P[1 . . m] = T[s + 1 . . s + m]
5 then print "Pattern occurs with shift" s
NAÏVE APPROACH
T: a b c a b d a a b c d e
P: a b d
Example ( Step – 1 )
T: a b c a b d a a b c d e
P: a b d
Mismatch after 3 Comparisons
Example ( Step – 2 )
T: a b c a b d a a b c d e
a b d
P:
Mismatch after 1 Comparison
Example ( Step – 3 )
T: a b c a b d a a b c d e
a b d
P:
Mismatch after 1 Comparison
Example ( Step – 4 )
T: a b c a b d a a b c d e
a b d
P:
Match found after 8 Comparisons
Thus, after 8 comparisons the
substring P is found in T.
Naive (char* T,char*P)
{
int n=strlen(T);
int m=strlen(P);
int s;
for(s=0;s< n-m+1; s++)
{
int j;
for(j=0;j<m;j++)
If(T(s+j)!=P[j])
break;
If(j=m)
Printf(“%d”, “pattern is found at shift ”+s);
}
}
s
T:
a b c a b d a a b c d e
j Naive (char* T,char*P)
P: n= 12
m=3 {
a b d s=0, j=0
Int n=strlen(T);
T[0]=P[0]
j=1 Int m= strlen(P);
T[0+1]=P[1] Int s;
j=2
for(s=0;s<n-m+1;s++)
T[0+2]!=P[2]
s=1, j=0 {
T[1+0]!=P[0] int j;
s=2, J=0
for(j=0;j<m;j++)
T[2+0]!=P[0]
s=3, J=0 If(T(s+j)!=P[j])
T[3+0]=P[0] break;
J=1
If(j=m)
T[3+1]=P[1]
J=2 Printf(“%d”, “pattern is found at
T[3+2]=P[2] shift ”+s);
}
}
The Knuth-Morris-Pratt (KMP)Algorithm
Knuth-Morris and Pratt introduce a linear time algorithm for the
string matching problem. A matching time of O (n) is achieved
by avoiding comparison with an element of 'S' that have
previously been involved in comparison with some element of
the pattern ‘P' to be matched. i.e., backtracking on the string 'S'
never occurs
Components of KMP Algorithm:
1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates
knowledge about how the pattern matches against the shift of itself. This
information can be used to avoid a useless shift of the pattern ‘P.' In other words,
this enables avoiding backtracking of the string 'S.‘
2. The KMP Matcher: With string 'S,' pattern ‘P' and prefix function 'Π' as
inputs, find the occurrence of ‘P' in 'S' and returns the number of shifts of ‘P'
after which occurrences are found.
Π table generates longest prefix that same as suffix
P a b c d a b c a b f
Π 0 0 0 0 1 2 3 1 2 0
P a b c d e a b f a b c
Π 0 0 0 0 0 1 2 0 1 2 3
P a a a a b a a c d
Π 0 1 2 3 0 1 2 0 0
P a b a b a b a b c a
Π 0 0 1 2 3 4 5 6 0 1
The Prefix Function (Π)
Following pseudo code compute the prefix function,
Π:
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Running Time Analysis:
In the above pseudo code for calculating the prefix function, the for loop
from step 4 to step 10 runs 'm' times. Step1 to Step3 take constant time.
Hence the running time of computing prefix function is O (m).
Example: Compute Π for the pattern ‘P' below:
Solution: Initially: m = length [P] = 7 Π [1] = 0 k=0
1. m ←length [P] //'p' pattern
to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and
P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
1. m ←length [P] //'p' pattern to be
matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and
P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
After iteration 6 times, the prefix function computation is complete:
The KMP Matcher with the pattern 'p,' the string 'S' and prefix
function 'Π' as input, finds a match of p in S. Following pseudo
code compute the matching component of KMP algorithm:
KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
[Link] q = m // is all of p matched?
11. then print "Pattern occurs with shift" i – m+1
12. q ← Π [q] // look for the next match
Running Time Analysis:
The for loop beginning in step 5 runs 'n' times, i.e., as long as the
length of the string 'S.' Since step 1 to step 4 take constant times,
the running time is dominated by this for the loop. Thus running
time of the matching function is O (n).
Example: Given a string 'T' and pattern 'P' as follows:
i
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b a b c a b c a b a b a b d
j
0 1 2 3 4 5
1. Take two variables i and j
2. Initialize i=1 and j=0
a b a b d
3. compare i and j+1
Π 0 0 1 2 0 If match, move i and j both to
right
else initialize j to its Π value.
If i and j+1 mismatch and j=0
then increase i by 1.
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For ‘P' the prefix function, ? was computed previously and is as
follows:
Solution:
Initially: n = size of T = 15 m = size of P = 7
1. Take two variables i and q
2. Initialize i=1 and q=0
3. compare i and q+1
If match, move i and q both to right
else initialize q to its Π value.
If i and q+1 mismatch and q=0 then
increase i by 1.
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does
not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character
matches
10. If q = m // is all of p matched?
11. then print "Pattern occurs with shift“
i-m
12. q ← Π [q] // look for the next match
Rabin-Karp Algorithm
Rabin-Karp Algorithm is a string searching algorithm created
by Richard M. Karp and Michael O. Rabin that uses hashing to
find any one of a set of pattern strings in a text.
In Rabin-Karp algorithm, we'll generate a hash of
our pattern that we are looking for & check if the rolling hash
of our text matches the pattern or not.
If it doesn't match, we can guarantee that the pattern doesn't
exist in the text. However, if it does match, the pattern can be
present in the text.
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-
m+1) m but it has a good average case running time. If the expected number of
strong shifts is small O (1) and prime q is chosen to be quite large, then the
Rabin-Karp algorithm can be expected to run in time O (n+m) plus the time to
require to process spurious hits.
Rabin-karp(T,P)
n=length(T)
m=length(P)
H(P) =hash[P[1...m]] i.e. Pmod q
H(T) =hash[T[1...m]] i.e. Tmod q
For s=0 to n-m
if (h(P)=h(T))
if(P[0.......m-1] = T[s+0,s+1,.....S+m-1])
print “Pattern found with shipt”+s
if (s<n-m)
h(T)=hash(T(s+1,..........s+m))
String Matching with Finite Automata
This string matching algorithm is very efficient to
examine pattern by examine each text character
exactly once. Hence the time complexity is O(n).
A finite automaton is a quintuple (Q, , , s, F):
Q: the finite set of states
: the finite input alphabet
: the “transition function” from Qx to Q
s Q: the start state
F Q: the set of final (accepting) states
How it works
A finite automaton accepts
strings in a specific language. It
begins in state q0 and reads
characters one at a time from
the input string. It makes
transitions based on these
characters, and if when it
reaches the end of the tape it
is in one of the accept states,
that string is accepted by the
language.
The Suffix Function
In order to properly
search for the string, the P = abcdabc
Prefixes are:
program must define a a, ab, abc, abcd, abcda, abcdab
suffix function () which
Suffixes are:
checks to see how much c,bc,abc, dabc, cdabc, bcdabc
of what it is reading
Suffix function finds the longest
matches the search substring in the pattern where prefix is
equal to suffix. Here abc
string at any given
moment.
String-Matching Automata
• For any pattern P of length m, we can define its
string matching automata:
Q = {0,…,m} (states)
q0 = 0 (start state)
F = {m} (accepting state)
(q,a) = (Pqa)
The transition function chooses the next state to
maintain the invariant:
(Ti) = (Ti)
After scanning in i characters, the state number is the
longest prefix of P that is also a suffix of Ti.
Finite-Automaton-Matcher
The simple loop
structure implies a
running time for a
string of length n is
O(n).
However: this is only
the running time for
the actual string
matching. It does not
include the time it
takes to compute the
transition function.
b,c
b,c