## Simple linear work suffix array construction (2003)

### Cached

### Download Links

Citations: | 150 - 6 self |

### BibTeX

@INPROCEEDINGS{Kärkkäinen03simplelinear,

author = {Juha Kärkkäinen and Peter Sanders and Stefan Burkhardt},

title = {Simple linear work suffix array construction},

booktitle = {},

year = {2003},

pages = {943--955},

publisher = {Springer}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

### Citations

8543 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ... model of computation complexity alphabet � � External Memory [38] n n O DB log M B B I/Os D disks, block size B, � � integer n fast memory of size M O n log M B B internal work � � n Cache Oblivious =-=[15]-=- O cache faults general BSP [37] P processors h-relation in time L + gh B log M B n B � n log n O P + L log2 � gn log n P + P log(n/P ) time general P = O � n 1−ɛ� processors O � n/P + L log 2 P + gn/... |

1132 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ...ger [9] log M Cache Oblivious [15] M/B cache blocks of size B � n I/Os B B � � n O n log M internal work B B � n O B integer [14],skew log M n B B log � 2 n cache faults � n O B general [9] log M BSP =-=[37]-=- P processors h-relation in time L + gh � n cache faults B B � n log n gn O +(L + P P general [14],skew ) log3 � n log P time log(n/P ) � n log n O P general [12] + L log2 � gn log n P + time P log(n/... |

645 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...or integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algorithms for many advanced models of computation (see Table 1). The suffix array =-=[18,31]-=- is a lexicographically sorted array of the suffixes of a string. For several applications, the suffix array is a simpler and more compact alternative to the suffix tree [2,6,18,31]. The suffix array ... |

636 |
An Introduction to Parallel Algorithms
- JAJA
- 1992
(Show Context)
Citation Context ...kew ) log3 � n log P time log(n/P ) � n log n O P general [12] + L log2 � gn log n P + time P log(n/P ) general skew P = O � n 1−ɛ� processors O � n/P + L log 2 P + gn/P � time integer skew EREW-PRAM =-=[25]-=- O � log 4 n � time, O(n log n) work general [12] O � log 2 n � time, O(n log n) work general skew arbitrary-CRCW-PRAM [25] O(log n) time, O(n) work (rand.) constant [13] priority-CRCW-PRAM [25] O � l... |

566 | A Block – Sorting Lossless Data compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...e 1). The suffix array [18,31] is a lexicographically sorted array of the suffixes of a string. For several applications, the suffix array is a simpler and more compact alternative to the suffix tree =-=[2,6,18,31]-=-. The suffix array can be constructed in linear time by a lexicographic traversal of the suffix tree, but such a construction loses some of the advantage that the suffix array has over the suffix tree... |

549 |
A space-economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...onal biology [21] and elsewhere [20]. One of the important properties of the suffix tree is that it can be constructed in linear time in the length of the string. The classical linear time algorithms =-=[32,36,39]-=- require a constant alphabet size, but Farach’s algorithm [11,14] works also for integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algor... |

329 |
Fast Algorithms for Finding Nearest Common Ancestors
- Harel, Tarjan
(Show Context)
Citation Context ...cp(SA[i],SA[i + 1]). A well-known property of lcps is that for any 0 ≤ i<j<n, lcp(i, j) = min LCP[k] . i≤k<j Thus, if we preprocess LCP in linear time to answer range minimum queries in constant time =-=[3,4,24]-=-, we can find the longest common prefix of any two suffixes in constant time. We will show how the LCP array can be computed from the LCP 12 array corresponding to SA 12 in linear time. Let j = SA[i] ... |

328 | On-line construction of suffix trees
- Ukkonen
- 1995
(Show Context)
Citation Context ...onal biology [21] and elsewhere [20]. One of the important properties of the suffix tree is that it can be constructed in linear time in the length of the string. The classical linear time algorithms =-=[32,36,39]-=- require a constant alphabet size, but Farach’s algorithm [11,14] works also for integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algor... |

284 |
Parallel merge sort
- Cole
- 1988
(Show Context)
Citation Context ...sorting with O( n B logM/B n B ) cache faults and o(n log n) work. The result is an immediate corollary of the optimal comparison based sorting algorithm [15]. EREW PRAM: We can use Cole’s merge sort =-=[8]-=- for sorting and merging. Lexicographic naming can be implemented using linear work and O(log P ) time using prefix sums. After Θ(log P ) levels of recursion, the problem size has reduced so far that ... |

236 | Algorithms for parallel memory I: Two-level memories
- Vitter, Shriver
- 1994
(Show Context)
Citation Context ...ffix tree, which can be then be transformed into a suffix array model of computation complexity alphabet source RAM O(n log n) time general [31,30,5] O(n) time integer [11,28,29],skew External Memory =-=[38]-=- D disks, block size B, fast memory of size M � n O DB log M n B B log2 n � I/Os � n O n log M B B log2 n � internal work � n O DB integer [9] log M Cache Oblivious [15] M/B cache blocks of size B � n... |

184 | The LCA problem revisited
- Bender, Farach-Colton
- 2000
(Show Context)
Citation Context ...cp(SA[i],SA[i + 1]). A well-known property of lcps is that for any 0 ≤ i<j<n, lcp(i, j) = min LCP[k] . i≤k<j Thus, if we preprocess LCP in linear time to answer range minimum queries in constant time =-=[3,4,24]-=-, we can find the longest common prefix of any two suffixes in constant time. We will show how the LCP array can be computed from the LCP 12 array corresponding to SA 12 in linear time. Let j = SA[i] ... |

147 | Algorithms on Strings - Gusfield - 1997 |

129 | Optimal suffix tree construction with large alphabets
- Farach
- 1997
(Show Context)
Citation Context ...s of the suffix tree is that it can be constructed in linear time in the length of the string. The classical linear time algorithms [32,36,39] require a constant alphabet size, but Farach’s algorithm =-=[11,14]-=- works also for integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algorithms for many advanced models of computation (see Table 1). The ... |

127 | Replacing suffix trees with enhanced suffix arrays - MI, Kurtz, et al. - 2004 |

120 | The string B-tree: a new data structure for string search in external memory and its applications - Ferragina, Grossi - 1999 |

79 | Linear-time longestcommon-prefix computation in suffix arrays and its applications
- Kasai, Lee, et al.
(Show Context)
Citation Context ...thm is much simpler than the best previous algorithm. In many applications, the suffix array needs to be augmented with additional data, the most important being the longest common prefix (lcp) array =-=[1,2,26, 27,31]-=-. In particular, the suffix tree can be constructed easily from the suffix and lcp arrays [11,13,14]. There is a linear time algorithm for computing the lcp array from the suffix array [27], but it do... |

76 | Nearest common ancestors: A survey and a new algorithm for a distributed environment. Theory Comput
- Alstrup, Gavoille, et al.
- 2004
(Show Context)
Citation Context ...cp(SA[i],SA[i + 1]). A well-known property of lcps is that for any 0 ≤ i<j<n, lcp(i, j) = min LCP[k] . i≤k<j Thus, if we preprocess LCP in linear time to answer range minimum queries in constant time =-=[3,4,24]-=-, we can find the longest common prefix of any two suffixes in constant time. We will show how the LCP array can be computed from the LCP 12 array corresponding to SA 12 in linear time. Let j = SA[i] ... |

76 |
pattern matching algorithm
- Linear
- 1973
(Show Context)
Citation Context ...al memory with parallel disks, cache oblivious, and parallel. The adaptations for BSP and EREW-PRAM are asymptotically faster than the best previously known algorithms. 1 Introduction The suffix tree =-=[39]-=- of a string is a compact trie of all the suffixes of the string. It is a powerful data structure with numerous applications in computational biology [21] and elsewhere [20]. One of the important prop... |

72 | Space efficient linear time construction of suffix arrays
- Ko, Aluru
- 2003
(Show Context)
Citation Context ... integer alphabets. Independently of and in parallel with the present work, two other direct linear time suffix array construction algorithms have been introduced by Kim et al. [28], and Ko and Aluru =-=[29]-=-. The two algorithms are quite different from ours (and each other). The skew algorithm. Farach’s linear-time suffix tree construction algorithm [11] as well as some parallel and external algorithms [... |

64 | Communication-efficient parallel sorting
- Goodrich
- 1999
(Show Context)
Citation Context ...r. We get an overall execution time of O � n log n/P + log 2 P � . BSP: For the case of many processors, we proceed as for the EREW-PRAM algorithm using the optimal comparison based sorting algorithm =-=[19]-=- that takes log n time O(n log n/P +(gn/P + L) log(n/P ) ). For the case of few processors, we can use a linear work sorting algorithm based on radix sort [7] and a linear work merging algorithm [17].... |

61 | Optimal and sublogarithmic time randomized parallel sorting algorithms
- Rajasekaran, Reif
- 1989
(Show Context)
Citation Context ...n of the skew algorithm. Then we can afford to switch to a comparison based algorithm without increasing the overall amount of internal work. CRCW PRAM: We employ the stable integer sorting algorithm =-=[35]-=- that works in O(log n) time using linear work for keys with O(log log n) bits. This algorithm can be used for the first Θ(log log log n) iterations. Then we can afford to switch to the algorithm [22]... |

55 | A space-economical sux tree construction algorithm - McCreight - 1976 |

54 |
Linear-time construction of suffix arrays
- Kim, Sim, et al.
- 2003
(Show Context)
Citation Context ...x tree construction for integer alphabets. Independently of and in parallel with the present work, two other direct linear time suffix array construction algorithms have been introduced by Kim et al. =-=[28]-=-, and Ko and Aluru [29]. The two algorithms are quite different from ours (and each other). The skew algorithm. Farach’s linear-time suffix tree construction algorithm [11] as well as some parallel an... |

51 |
Deterministic distribution sort in shared and distributed memory multiprocessors
- Nodine, Vitter
- 1993
(Show Context)
Citation Context ...nstant Proof. External Memory: Sorting tuples and lexicographic naming is easily reduced to external memory integer sorting. I/O optimal deterministic2 parallel disk sorting algorithms are well known =-=[34,33]-=-. We have to make a few remarks regarding internal work however. To achieve optimal internal work for all values of n, M, and B, we can use radix sort where the most significant digit has ⌊log M⌋ −1 b... |

50 | Breaking a time-and-space barrier in constructing full-text indices - Hon, Sadakane, et al. - 2003 |

45 | Faster suffix sorting
- Larsson, Sadakane
- 1999
(Show Context)
Citation Context ...a construction loses some of the advantage that the suffix array has over the suffix tree. The fastest direct suffix array construction algorithms that do not use suffix trees require O(n log n) time =-=[5,30,31]-=-. Also under other models of computation, direct ⋆ Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186(ALCOM-FT). J.C.M. Baeten et al. ... |

44 |
On the sorting-complexity of suffix tree construction
- Farach-Colton, Ferragina, et al.
(Show Context)
Citation Context ...s of the suffix tree is that it can be constructed in linear time in the length of the string. The classical linear time algorithms [32,36,39] require a constant alphabet size, but Farach’s algorithm =-=[11,14]-=- works also for integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algorithms for many advanced models of computation (see Table 1). The ... |

44 | The enhanced suffix array and its applications to genome analysis
- Abouelhoda, Kurtz, et al.
- 2002
(Show Context)
Citation Context ...thm is much simpler than the best previous algorithm. In many applications, the suffix array needs to be augmented with additional data, the most important being the longest common prefix (lcp) array =-=[1,2,26, 27,31]-=-. In particular, the suffix tree can be constructed easily from the suffix and lcp arrays [11,13,14]. There is a linear time algorithm for computing the lcp array from the suffix array [27], but it do... |

41 | New indices for text: Pat trees and pat arrays - Gonnet, Baeza-Yates, et al. - 1992 |

39 | Suffix cactus : a cross between suffix tree and suffix array
- Karkkainen
- 1995
(Show Context)
Citation Context ...thm is much simpler than the best previous algorithm. In many applications, the suffix array needs to be augmented with additional data, the most important being the longest common prefix (lcp) array =-=[1,2,26, 27,31]-=-. In particular, the suffix tree can be constructed easily from the suffix and lcp arrays [11,13,14]. There is a linear time algorithm for computing the lcp array from the suffix array [27], but it do... |

39 |
New indices for text
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...or integer alphabets, i.e., when characters are polynomially bounded integers. There are also efficient construction algorithms for many advanced models of computation (see Table 1). The suffix array =-=[18,31]-=- is a lexicographically sorted array of the suffixes of a string. For several applications, the suffix array is a simpler and more compact alternative to the suffix tree [2,6,18,31]. The suffix array ... |

38 | Optimal Exact String Matching Based on Suffix Arrays
- Abouelhoda, EnnoOhlebusch, et al.
- 2002
(Show Context)
Citation Context ...e 1). The suffix array [18,31] is a lexicographically sorted array of the suffixes of a string. For several applications, the suffix array is a simpler and more compact alternative to the suffix tree =-=[2,6,18,31]-=-. The suffix array can be constructed in linear time by a lexicographic traversal of the suffix tree, but such a construction loses some of the advantage that the suffix array has over the suffix tree... |

32 | Overcoming the memory bottleneck in suffix tree construction
- Farach, Ferragina, et al.
- 1998
(Show Context)
Citation Context ...]. The two algorithms are quite different from ours (and each other). The skew algorithm. Farach’s linear-time suffix tree construction algorithm [11] as well as some parallel and external algorithms =-=[12,13,14]-=- are based on the following divide-and-conquer approach: 1. Construct the suffix tree of the suffixes starting at odd positions. This is done by reduction to the suffix tree construction of a string o... |

32 | On-line construction of sux trees - Ukkonen - 1995 |

31 | Suffix trees on words - Andersson, Larsson, et al. - 1999 |

30 | Better external memory suffix array construction - Dementiev, Kärkkäinen, et al. - 2008 |

25 | Fast lightweight suffix array construction and checking
- Burkhardt, Kärkkäinen
- 2003
(Show Context)
Citation Context ...a construction loses some of the advantage that the suffix array has over the suffix tree. The fastest direct suffix array construction algorithms that do not use suffix trees require O(n log n) time =-=[5,30,31]-=-. Also under other models of computation, direct ⋆ Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186(ALCOM-FT). J.C.M. Baeten et al. ... |

25 |
Waste makes haste: Tight bounds for loose parallel sorting
- Hagerup, Raman
- 1992
(Show Context)
Citation Context ...[35] that works in O(log n) time using linear work for keys with O(log log n) bits. This algorithm can be used for the first Θ(log log log n) iterations. Then we can afford to switch to the algorithm =-=[22]-=- that works for polynomial size keys at the price of being inefficient by a factor O(log log n). Lexicographic naming can be implemented by computing prefix sums using linear work and logarithmic time... |

24 | Fast algorithms for nearest common ancestors - Harel, Tarjan - 1984 |

22 |
Optimal merging and sorting
- Hagerup, Rub
- 1989
(Show Context)
Citation Context ...g log n). Lexicographic naming can be implemented by computing prefix sums using linear work and logarithmic time. Comparison based merging can be implemented with linear work and O(log n) time using =-=[23]-=-. ⊓⊔ The resulting algorithms are simple except that they may use complicated subroutines for sorting to obtain theoretically optimal results. There are usually much simpler implementations of sorting... |

22 | Asynchronous parallel disk sorting
- Dementiev, Sanders
- 2003
(Show Context)
Citation Context ...an use a linear work sorting algorithm based on radix sort [7] and a linear work merging algorithm [17]. The integer 2 Simpler randomized algorithms with favorable constant factors are also available =-=[10]-=-.sSimple Linear Work Suffix Array Construction 949 sorting algorithm remains applicable at least during the first Θ(log log n) levels of recursion of the skew algorithm. Then we can afford to switch t... |

22 | Optimal sux tree construction with large alphabets - Farach - 1997 |

21 | Theoretical and experimental study on the construction of suffix arrays in external memory
- Crauser, Ferragina
(Show Context)
Citation Context ... J.C.M. Baeten et al. (Eds.): ICALP 2003, LNCS 2719, pp. 943–955, 2003. c○ Springer-Verlag Berlin Heidelberg 2003s944 J. Kärkkäinen and P. Sanders algorithms cannot match suffix tree based algorithms =-=[9,16]-=-. The existence of an I/O-optimal direct algorithm is mentioned as an important open problem in [9]. We introduce the skew algorithm, the first linear-time direct suffix array construction algorithm f... |

21 | Constructing compressed suffix arrays with large alphabets - HON, LAM, et al. - 2003 |

20 |
Greed sort: An optimal sorting algorithm for multiple disks
- Nodine, Vitter
- 1995
(Show Context)
Citation Context ...nstant Proof. External Memory: Sorting tuples and lexicographic naming is easily reduced to external memory integer sorting. I/O optimal deterministic2 parallel disk sorting algorithms are well known =-=[34,33]-=-. We have to make a few remarks regarding internal work however. To achieve optimal internal work for all values of n, M, and B, we can use radix sort where the most significant digit has ⌊log M⌋ −1 b... |

17 | Suffix trees and their applications in string algorithms. Rapporto di Ricerca CS-96-14, Università “Ca’ Foscari” di Venezia
- Grossi, Italiano
- 1996
(Show Context)
Citation Context ...troduction The suffix tree [39] of a string is a compact trie of all the suffixes of the string. It is a powerful data structure with numerous applications in computational biology [21] and elsewhere =-=[20]-=-. One of the important properties of the suffix tree is that it can be constructed in linear time in the length of the string. The classical linear time algorithms [32,36,39] require a constant alphab... |

14 | Optimal logarithmic time randomized suffix tree construction
- Farach-Colton, Muthukrishnan
- 1996
(Show Context)
Citation Context ...]. The two algorithms are quite different from ours (and each other). The skew algorithm. Farach’s linear-time suffix tree construction algorithm [11] as well as some parallel and external algorithms =-=[12,13,14]-=- are based on the following divide-and-conquer approach: 1. Construct the suffix tree of the suffixes starting at odd positions. This is done by reduction to the suffix tree construction of a string o... |

14 | Fast BWT in small space by blockwise suffix sorting - Kärkkäinen - 2007 |

10 | A note on coarse grained parallel integer sorting
- Chan, Dehne
- 1999
(Show Context)
Citation Context ...timal comparison based sorting algorithm [19] that takes log n time O(n log n/P +(gn/P + L) log(n/P ) ). For the case of few processors, we can use a linear work sorting algorithm based on radix sort =-=[7]-=- and a linear work merging algorithm [17]. The integer 2 Simpler randomized algorithms with favorable constant factors are also available [10].sSimple Linear Work Suffix Array Construction 949 sorting... |

10 | Distributed and paged suffix trees for large genetic databases - Clifford, Sergot - 2003 |