0% found this document useful (0 votes)
25 views38 pages

String Matching Algorithms Explained

String matching involves finding occurrences of a pattern within a text, with applications in various fields like web searching and document processing. Several algorithms are used for this purpose, including the Naive String Matcher, Knuth-Morris-Pratt, Rabin-Karp, and string matching with finite automata, each with different efficiencies and methods of operation. The document provides detailed explanations of these algorithms, including their implementations and complexities.

Uploaded by

shreyathakur4637
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

String Matching Algorithms Explained

String matching involves finding occurrences of a pattern within a text, with applications in various fields like web searching and document processing. Several algorithms are used for this purpose, including the Naive String Matcher, Knuth-Morris-Pratt, Rabin-Karp, and string matching with finite automata, each with different efficiencies and methods of operation. The document provides detailed explanations of these algorithms, including their implementations and complexities.

Uploaded by

shreyathakur4637
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

String Matching?

String Matching consists of finding all of the occurrences of


given string in a text
Some of the Applications are
• Finding patterns in documents formed using a large
alphabet
– Word processing
– Web searching
– Desktop search (Google, MSN)
• Matching strings of bytes containing
– Graphical data
– Machine code
• grep in unix
– grep searches for lines matching a pattern.
Pattern Matching
• Given a text string T[0..n-1] and a pattern
P[0..m-1], find all occurrences of the pattern
within the text.

• Example: T = ababcabdabcaabc and P = abc, the


occurrences are:
– first occurrence starts at T[3]
– second occurrence starts at T[9]
– third occurrence starts at T[13]
Let Σ denotes the set of alphabet .

• Given:

A string of alphabets T[1..n] of size “n” and a pattern P[1..m] of


size “m”
where, m<n.

• To Find:

Whether the pattern P occurs in text T or not. If it does, then give


the first occurrence of P in T.

The alphabets of both T and P are drawn from finite set Σ.


ALGORITM USED FOR STRING MATCHING

• The NAIVE STRING-MATCHER ALGORITHM


• The KNUTH-MORRIS-PRATT ALGORITHM
• The RABIN-KARP ALGORITHM
• STRING MATCHING WITH FINITE AUTOMATA ALGORITHM
The NAIVE-STRING-MATCHER algorithm
The naive algorithm finds all valid shifts using a loop that checks
the condition P[1 . . m] = T[s + 1 . . s + m] for each of the n - m + 1
possible values of s.

NAIVE-STRING-MATCHER(T, P)

1n length[T]

2m length[P]

3 for s 0 to n – m

4 do if P[1 . . m] = T[s + 1 . . s + m]

5 then print "Pattern occurs with shift" s


NAÏVE APPROACH

T: a b c a b d a a b c d e

P: a b d
Example ( Step – 1 )

T: a b c a b d a a b c d e

P: a b d

Mismatch after 3 Comparisons


Example ( Step – 2 )

T: a b c a b d a a b c d e

a b d
P:

Mismatch after 1 Comparison


Example ( Step – 3 )

T: a b c a b d a a b c d e

a b d
P:

Mismatch after 1 Comparison


Example ( Step – 4 )

T: a b c a b d a a b c d e

a b d
P:

Match found after 8 Comparisons

Thus, after 8 comparisons the


substring P is found in T.
Naive (char* T,char*P)
{
int n=strlen(T);
int m=strlen(P);
int s;
for(s=0;s< n-m+1; s++)
{
int j;
for(j=0;j<m;j++)
If(T(s+j)!=P[j])
break;
If(j=m)
Printf(“%d”, “pattern is found at shift ”+s);
}
}
s
T:
a b c a b d a a b c d e
j Naive (char* T,char*P)
P: n= 12
m=3 {
a b d s=0, j=0
Int n=strlen(T);
T[0]=P[0]
j=1 Int m= strlen(P);
T[0+1]=P[1] Int s;
j=2
for(s=0;s<n-m+1;s++)
T[0+2]!=P[2]
s=1, j=0 {
T[1+0]!=P[0] int j;
s=2, J=0
for(j=0;j<m;j++)
T[2+0]!=P[0]
s=3, J=0 If(T(s+j)!=P[j])
T[3+0]=P[0] break;
J=1
If(j=m)
T[3+1]=P[1]
J=2 Printf(“%d”, “pattern is found at
T[3+2]=P[2] shift ”+s);
}
}
The Knuth-Morris-Pratt (KMP)Algorithm
Knuth-Morris and Pratt introduce a linear time algorithm for the
string matching problem. A matching time of O (n) is achieved
by avoiding comparison with an element of 'S' that have
previously been involved in comparison with some element of
the pattern ‘P' to be matched. i.e., backtracking on the string 'S'
never occurs
Components of KMP Algorithm:

1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates
knowledge about how the pattern matches against the shift of itself. This
information can be used to avoid a useless shift of the pattern ‘P.' In other words,
this enables avoiding backtracking of the string 'S.‘

2. The KMP Matcher: With string 'S,' pattern ‘P' and prefix function 'Π' as
inputs, find the occurrence of ‘P' in 'S' and returns the number of shifts of ‘P'
after which occurrences are found.
Π table generates longest prefix that same as suffix
P a b c d a b c a b f
Π 0 0 0 0 1 2 3 1 2 0

P a b c d e a b f a b c
Π 0 0 0 0 0 1 2 0 1 2 3

P a a a a b a a c d
Π 0 1 2 3 0 1 2 0 0

P a b a b a b a b c a
Π 0 0 1 2 3 4 5 6 0 1
The Prefix Function (Π)
Following pseudo code compute the prefix function,
Π:

COMPUTE- PREFIX- FUNCTION (P)


1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Running Time Analysis:

In the above pseudo code for calculating the prefix function, the for loop
from step 4 to step 10 runs 'm' times. Step1 to Step3 take constant time.
Hence the running time of computing prefix function is O (m).
Example: Compute Π for the pattern ‘P' below:

Solution: Initially: m = length [P] = 7 Π [1] = 0 k=0

1. m ←length [P] //'p' pattern


to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and
P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
1. m ←length [P] //'p' pattern to be
matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and
P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π

After iteration 6 times, the prefix function computation is complete:


The KMP Matcher with the pattern 'p,' the string 'S' and prefix
function 'Π' as input, finds a match of p in S. Following pseudo
code compute the matching component of KMP algorithm:

KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
[Link] q = m // is all of p matched?
11. then print "Pattern occurs with shift" i – m+1
12. q ← Π [q] // look for the next match
Running Time Analysis:

The for loop beginning in step 5 runs 'n' times, i.e., as long as the
length of the string 'S.' Since step 1 to step 4 take constant times,
the running time is dominated by this for the loop. Thus running
time of the matching function is O (n).

Example: Given a string 'T' and pattern 'P' as follows:


i
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a b a b c a b c a b a b a b d
j
0 1 2 3 4 5
1. Take two variables i and j
2. Initialize i=1 and j=0
a b a b d
3. compare i and j+1
Π 0 0 1 2 0 If match, move i and j both to
right
else initialize j to its Π value.
If i and j+1 mismatch and j=0
then increase i by 1.
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For ‘P' the prefix function, ? was computed previously and is as
follows:

Solution:
Initially: n = size of T = 15 m = size of P = 7
1. Take two variables i and q
2. Initialize i=1 and q=0
3. compare i and q+1
If match, move i and q both to right
else initialize q to its Π value.
If i and q+1 mismatch and q=0 then
increase i by 1.

4. q ← 0 // numbers of characters matched


5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does
not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character
matches
10. If q = m // is all of p matched?
11. then print "Pattern occurs with shift“
i-m
12. q ← Π [q] // look for the next match
Rabin-Karp Algorithm
Rabin-Karp Algorithm is a string searching algorithm created
by Richard M. Karp and Michael O. Rabin that uses hashing to
find any one of a set of pattern strings in a text.
In Rabin-Karp algorithm, we'll generate a hash of
our pattern that we are looking for & check if the rolling hash
of our text matches the pattern or not.

If it doesn't match, we can guarantee that the pattern doesn't


exist in the text. However, if it does match, the pattern can be
present in the text.
Complexity:
The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-
m+1) m but it has a good average case running time. If the expected number of
strong shifts is small O (1) and prime q is chosen to be quite large, then the
Rabin-Karp algorithm can be expected to run in time O (n+m) plus the time to
require to process spurious hits.
Rabin-karp(T,P)
n=length(T)
m=length(P)
H(P) =hash[P[1...m]] i.e. Pmod q
H(T) =hash[T[1...m]] i.e. Tmod q
For s=0 to n-m
if (h(P)=h(T))
if(P[0.......m-1] = T[s+0,s+1,.....S+m-1])
print “Pattern found with shipt”+s
if (s<n-m)
h(T)=hash(T(s+1,..........s+m))
String Matching with Finite Automata

This string matching algorithm is very efficient to


examine pattern by examine each text character
exactly once. Hence the time complexity is O(n).

A finite automaton is a quintuple (Q, , , s, F):


Q: the finite set of states
: the finite input alphabet
: the “transition function” from Qx to Q
s Q: the start state
F  Q: the set of final (accepting) states
How it works

A finite automaton accepts


strings in a specific language. It
begins in state q0 and reads
characters one at a time from
the input string. It makes
transitions based on these
characters, and if when it
reaches the end of the tape it
is in one of the accept states,
that string is accepted by the
language.
The Suffix Function

In order to properly
search for the string, the P = abcdabc
Prefixes are:
program must define a a, ab, abc, abcd, abcda, abcdab
suffix function () which
Suffixes are:
checks to see how much c,bc,abc, dabc, cdabc, bcdabc
of what it is reading
Suffix function finds the longest
matches the search substring in the pattern where prefix is
equal to suffix. Here abc
string at any given
moment.
String-Matching Automata
• For any pattern P of length m, we can define its
string matching automata:
Q = {0,…,m} (states)
q0 = 0 (start state)
F = {m} (accepting state)
(q,a) = (Pqa)
The transition function chooses the next state to
maintain the invariant:
(Ti) = (Ti)
After scanning in i characters, the state number is the
longest prefix of P that is also a suffix of Ti.
Finite-Automaton-Matcher
The simple loop
structure implies a
running time for a
string of length n is
O(n).
However: this is only
the running time for
the actual string
matching. It does not
include the time it
takes to compute the
transition function.
b,c

b,c

You might also like