## Engineering a lightweight suffix array construction algorithm (Extended Abstract)

Citations: | 59 - 4 self |

### BibTeX

@MISC{Manzini_engineeringa,

author = {Giovanni Manzini and Paolo Ferragina},

title = {Engineering a lightweight suffix array construction algorithm (Extended Abstract)},

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper we consider the problem of computing the suffix array of a text T [1, n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (or pat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and ranking the answers to user queries in web-search engines [22]. In all these applications the construction of the suffix array is the computational bottleneck both in time and space. This motivated our interest in designing yet another suffix array construction algorithm which is fast and "lightweight" in the sense that it uses small space...

### Citations

916 |
Algorithms on Strings, Trees, and Sequences
- Gusfield
- 1997
(Show Context)
Citation Context ...The suffix array [16] (orpat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data =-=[4, 11]-=-. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably... |

657 | E.: Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...erragin@di.unipi.it 1 Introduction We consider the problem of computing the suffix array of a text T [1,n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array =-=[16]-=- (orpat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, th... |

579 | A block sorting lossless data compression algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ... biological data [4, 11]. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm =-=[3]-=-, which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its fu... |

335 |
Text Algorithms
- Crochemore, Rytter
- 1994
(Show Context)
Citation Context ...The suffix array [16] (orpat array [9]) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data =-=[4, 11]-=-. Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably... |

252 | Grouper: a dynamic clustering interface to Web search results. Computer Networks: The International Journal of Computer and Telecommunications Networking archive Volume 31, Issue 11-16 - Zamir, Etzioni - 1999 |

196 | High-order entropy-compressed text indexes - Grossi, Gupta, et al. - 2003 |

190 | Compressed suffix arrays and suffix trees with applications to text indexing and string matching
- Grossi, Vitter
(Show Context)
Citation Context ...lding block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct =-=[10, 19]-=- and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for... |

186 | G.: Opportunistic data structures with applications
- Ferragina, Manzini
- 2000
(Show Context)
Citation Context ...vel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed =-=[7, 8]-=- indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and rankin... |

153 | Simple linear work suffix array construction - Karkkainen, Sanders - 2003 |

149 | Fast Algorithms for Sorting and Searching Strings
- Bentley, Sedgewick
- 1997
(Show Context)
Citation Context ...he largest average lcp. 2 Definitions and Previous Results Let T [1,n] denote a text over the alphabet Σ. The suffix array [16] (orpat array [9]) for T is an array SA[1,n] such that T [SA[1],n], T [SA=-=[2]-=-,n], etc. is the list of suffixes of T sorted in lexicographic order. For example, for T = babcc then SA =[2, 1, 3, 5, 4] since T [2, 5] = abcc is the suffix with lower lexicographic rank, followed by... |

131 | An analysis of the Burrows-Wheeler transform
- Manzini
- 2001
(Show Context)
Citation Context ... Recently, the interest in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably =-=[17]-=- and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughl... |

121 | The string B-tree: a new data structure for string search in external memory and its applications
- Ferragina, Grossi
- 1999
(Show Context)
Citation Context ...til we reach a leaf ℓ. Then we compare si with the string associated to leaf ℓ and we determine the length of their common prefix. Finally, we update the trie adding the leaf corresponding to si (see =-=[6]-=- for details and for the merits of blind tries with respect to standard compacted tries). Obviously in the construction of the trie we ignore the first L characters of each suffix because they are ide... |

119 | Reducing the space requirement of suffix trees
- Kurtz
- 1999
(Show Context)
Citation Context ...ry”. R. Möhring and R. Raman (Eds.): ESA 2002, LNCS 2461, pp. 698–710, 2002. c○ Springer-Verlag Berlin Heidelberg 2002sEngineering a Lightweight Suffix Array Construction Algorithm 699 text structure =-=[14]-=-). This makes their use impractical even for moderately large texts. For this reason, suffix arrays are usually built using algorithms which run in O(n log n) time but have a smaller space occupancy. ... |

73 |
Rapid identification of repeated patterns in strings, trees and arrays
- Karp, Miller, et al.
- 1972
(Show Context)
Citation Context ...fferent formats; they also display a wide range of sizes and of average lcp’s. 2.1 The Larsson-Sadakane qsufsort Algorithm The qsufsort algorithm [15] is based on the doubling technique introduced in =-=[13]-=- and first used for the construction of the suffix array in [16]. Given two strings v, w and t>0wewritev<t w if the length-t prefix of v is lexicographically smaller than the length-t prefix of w. Sim... |

73 | Space-efficient linear time construction of suffix arrays, Accepted to Symp. Combinatorial Pattern Matching - Ko, Aluru - 2003 |

66 | An experimental study of an opportunistic index
- Ferragina, Manzini
(Show Context)
Citation Context ...vel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct [10, 19] and compressed =-=[7, 8]-=- indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for clustering and rankin... |

62 | Compressed text databases with efficient query algorithms based on the compressed suffix array
- Sadakane
- 2000
(Show Context)
Citation Context ...lding block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically [20] effective compression tool; (2) the construction of succinct =-=[10, 19]-=- and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used by traditional compressors for the text alone; and (3) algorithms for... |

59 | Engineering a sort function
- BENTLEY, MCILROY
- 1993
(Show Context)
Citation Context ...he one with the largest average lcp. 2 Definitions and Previous Results Let T [1,n] denote a text over the alphabet Σ. The suffix array [16] (orpat array [9]) for T is an array SA[1,n] such that T [SA=-=[1]-=-,n], T [SA[2],n], etc. is the list of suffixes of T sorted in lexicographic order. For example, for T = babcc then SA =[2, 1, 3, 5, 4] since T [2, 5] = abcc is the suffix with lower lexicographic rank... |

54 | Linear-time construction of suffix arrays - Kim, Sim, et al. - 2003 |

52 | Breaking a time-and-space barrier in constructing full-text indices - Hon, Sadakane, et al. |

47 | Faster suffix sorting
- Larsson, Sadakane
(Show Context)
Citation Context ...rrays are usually built using algorithms which run in O(n log n) time but have a smaller space occupancy. Among these algorithms the current “leader” is the qsufsort algorithm by Larsson and Sadakane =-=[15]-=-. qsufsort uses 8n bytes 1 and despite the O(n log n) worst case bound it is faster than the algorithms based on suffix tree construction. Unfortunately, the size of our documents has grown much more ... |

45 |
On the sorting-complexity of suffix tree construction
- Farach-Colton, Ferragina, et al.
(Show Context)
Citation Context ...store each integer in a four byte word; this yields a total space occupancy of 4n bytes. For what concerns the cost of constructing the suffix array, the theoretically best algorithms run in Θ(n) time=-=[5]-=-. These algorithms work by first building the suffix tree and then obtaining the sorted suffixes via an in-order traversal of the tree. However, suffix tree construction algorithms are both complex an... |

40 |
New indices for text
- Gomet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...t 1 Introduction We consider the problem of computing the suffix array of a text T [1,n]. This problem consists in sorting the suffixes of T in lexicographic order. The suffix array [16] (orpat array =-=[9]-=-) is a simple, easy to code, and elegant data structure used for several fundamental string matching problems involving both linguistic texts and biological data [4, 11]. Recently, the interest in thi... |

30 |
On the performance of BWT sorting algorithms
- Seward
- 2000
(Show Context)
Citation Context ... documents has grown much more quickly than the main memory of our computers. Thus, it is desirable to build a suffix array using as small space as possible. Recently, Itoh and Tanaka [12] and Seward =-=[21]-=- have proposed two new algorithms which only use 5n bytes. From the theoretical point of view these algorithms have a Θ � n 2 log n � worst case complexity. In practice they are faster than qsufsort w... |

26 | Fast lightweight suffix array construction and checking, Accepted to Symp. Combinatorial Pattern Matching - Burkhardt, Kärkkäinen - 2003 |

17 |
An efficient method for in memory construction of suffix arrays
- Itoh, Tanaka
- 1999
(Show Context)
Citation Context ... the size of our documents has grown much more quickly than the main memory of our computers. Thus, it is desirable to build a suffix array using as small space as possible. Recently, Itoh and Tanaka =-=[12]-=- and Seward [21] have proposed two new algorithms which only use 5n bytes. From the theoretical point of view these algorithms have a Θ � n 2 log n � worst case complexity. In practice they are faster... |

16 | Engineering radix sort
- Mcilroy, Bostic, et al.
- 1993
(Show Context)
Citation Context ...a Type A suffix we move it to the first empty position of bucket B T [i−1]. Type B suffixes are sorted using textbook string sorting algorithms: in their implementation the authors use MSD radix sort =-=[18]-=- for sorting large groups of suffixes, Bentley-Sedgewick multikey quicksort for medium size groups, and insertion sort for small groups. Summing up, two-stage can be considered an “advanced” direct co... |

15 | Compressed sux arrays and sux trees with applications to text indexing and string matching - Grossi, Vitter - 2000 |

9 |
The BZIP2 home page
- Seward
- 1997
(Show Context)
Citation Context ...st in this data structure has been revitalized by its use as a building block for three novel applications: (1) the Burrows-Wheeler compression algorithm [3], which is a provably [17] and practically =-=[20]-=- effective compression tool; (2) the construction of succinct [10, 19] and compressed [7, 8] indexes; the latter can store both the input text and its full-text index using roughly the same space used... |

8 | Compressed text databases with ecient query algorithms based on the compressed sux array - Sadakane - 2000 |

7 | Faster sux sorting - Larsson, Sadakane - 1999 |

6 | Rapid identi of repeated patterns in strings, arrays and trees - Karp, Miller, et al. - 1972 |

5 | Reducing the Space Requirement of Sux Trees. Software|Practice and Experience - Kurtz - 1999 |

4 | Improving suffix-array construction algorithms with applications - Kao |

3 | On the sorting-complexity of sux tree construction - Farach-Colton, Ferragina, et al. - 2000 |

3 | An ecient method for in memory construction of sux arrays - Itoh, Tanaka - 1999 |

1 | Mkvtree package (available upon request - Kurtz |

1 | Lightweight suffix sorting home page. http://www.mfn.unipmn. it/~manzini/lightweight - Manzini, Ferragina |