Suffix Array
Suffix Array
▪ A suffix array for a text T is an array of all suffixes of T arranged in sorted order
▪ It is an array of integers that represents the starting indices of all the suffixes of a
string, arranged in lexicographically sorted order
▪ It is a data structure used in string processing, especially for fast searching, pattern
matching and text indexing
▪ Applications:
▪ Solving the DNA sequencing problem in bioinformatics
▪ Finding a substring in a string
▪ Finding number of occurrences of a substring in a string
▪ Finding longest repeated substring in a string
▪ Comparing Two Substrings of a String
2
Suffix Array
▪ Example: Text string S = banana
3
SUFFIX ARRAY EXAMPLE
▪ In a suffix array, all suffixes of S are in the non-decreasing lexical order.
▪ For example, S=“ATCACATCATCA”
i 0 1 2 3 4 5 6 7 8 9 10 11
A 11 3 8 0 5 10 2 7 4 9 1 6
3 ATCACATCATCA S(0) 0 A S(11)
10 TCACATCATCA S(1) 1 ACATCATCA S(3)
6 CACATCATCA S(2) 2 ATCA S(8)
1 ACATCATCA S(3) 3 ATCACATCATCA S(0)
8 CATCATCA S(4) 4 ATCATCA S(5)
4 ATCATCA S(5) 5 CA S(10)
11 TCATCA S(6) 6 CACATCATCA S(2)
7 CATCA S(7) 7 CATCA S(7)
2 ATCA S(8) 8 CATCATCA S(4)
9 TCA S(9) 9 TCA S(9)
5 CA S(10) 10 TCACATCATCA S(1)
0 A S(11) 11 TCATCA S(6) 4
Search in Suffix Array
▪ A binary search of the suffix array would be enough to determine if the pattern P is in
the text
▪ O(m log n) where the log n comparisons for binary search on the suffix array, and m
time for each comparison during the binary search involves comparing the pattern
(of length m) with a suffix from the array
5
Linear Time Suffix Array Construction
Skew-algorithm
▪ Step 1:
SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3.
▪ Step 2:
SA= 0 = sort the suffixes starting at position i = 0 mod 3.
▪ Step 3:
SA = merge SA= 0 and SA≠ 0 .
6
Construction of Suffix Array in Linear Time
Radix sort sorts keys by processing them character by character (usually from rightmost to
leftmost
"caa", "cab", "aba", "aac"
7
Construction of Suffix Array in Linear Time
8
Construction of Suffix Array in Linear Time
9
Construction of Suffix Array in Linear Time
10
Construction of Suffix Array in Linear Time
S = ABRACADABRA
11
Construction of Suffix Array in Linear Time
12
Time Complexity Analysis
▪ Step1: O(n) + T(2n/3)
▪ Step2: O(n)
▪ Step3: O(n)
▪ T(n) = O(n) + T(2n/3) = O(n)