## Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Citations: | 51 - 3 self |

### BibTeX

@MISC{Hon_breakinga,

author = {Wing-kai Hon and Kunihiko Sadakane and Wing-kin Sung},

title = { Breaking a Time-and-Space Barrier in Constructing Full-Text Indices},

year = {}

}

### OpenURL

### Abstract

Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indicescan be constructed in both o(n log n) time and o(n log n)-bit working space, where n denotes the length of the text. Inthe literature, the fastest algorithm runs in O(n) time, whileit requires O(n log n)-bit working space. On the other hand,the most space-efficient algorithm requires O(n)-bit work-ing space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space bar-rier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alpha-bets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm re-quires O(n logffl n) time and O(n)-bit working space forany 0! ffl! 1. Apart from that, our algorithm can alsobe adopted to build other existing full-text indices, such as

### Citations

667 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...se texts, inverted index is not suitable. In this case, we need full-text indices, that is, indexing data structures which make no assumption on the word boundary. Suffix trees [19] and suffix arrays =-=[18]-=- are two fundamental full-text indices in the ∗ Preliminary version appears in the Proceedings of the 44th Symposium on Foundations of Computer Science, pages 251–260, 2003. † Department of Computer S... |

591 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...ance of our algorithms for constructing these indices are summarized in Table 1. Another application of our algorithm is that, it can act as a time and space efficient algorithm for the block sorting =-=[2]-=-, which is a widely used process in various compression schemes, such asbzip2 [25]. We also study the general case where the alphabet size is not constant. Let Σ be the alphabet, and |Σ| denote its si... |

565 |
A space-economical suffix tree construction algorithm
- Mccreight
- 1976
(Show Context)
Citation Context ...ences or Chinese/Japanese texts, inverted index is not suitable. In this case, we need full-text indices, that is, indexing data structures which make no assumption on the word boundary. Suffix trees =-=[19]-=- and suffix arrays [18] are two fundamental full-text indices in the ∗ Preliminary version appears in the Proceedings of the 44th Symposium on Foundations of Computer Science, pages 251–260, 2003. † D... |

439 | Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...en n is large. The construction algorithms for both of them are either too slow, or require too much working space. For instance, when we optimize the construction time, based on the work from Weiner =-=[28]-=-, McCreight [19], Ukkonen [27], and Farach [4], a suffix tree and a suffix array can be built in O(n) time. However, the working space required is Ω(n log n) bits. On the other hand, when we optimize ... |

421 |
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
- Gusfield
- 1997
(Show Context)
Citation Context ...ol of Computing, National University of Singapore, Singapore. (Email: ksung@comp.nus.edu.sg) 1sliterature, which find numerous applications in areas including data mining [26] and biological research =-=[8]-=-. For the rest of the full-text indices, almost all of them are originated from these two data structures. Suffix trees and suffix arrays are very useful since they allow us to perform pattern searchi... |

339 | On-line construction of suffix trees
- Ukkonen
- 1995
(Show Context)
Citation Context ...n algorithms for both of them are either too slow, or require too much working space. For instance, when we optimize the construction time, based on the work from Weiner [28], McCreight [19], Ukkonen =-=[27]-=-, and Farach [4], a suffix tree and a suffix array can be built in O(n) time. However, the working space required is Ω(n log n) bits. On the other hand, when we optimize the construction working space... |

199 | Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets
- Raman, Raman, et al.
- 2002
(Show Context)
Citation Context ...nk of ρ(c, r) in the set of all Ψ ′ values. The following theorem shows a data structure that is useful for finding such rank. Note that in contrast to the previous data structures for the rank query =-=[12, 23]-=-, our data structure requires either less space for storage, or less time in the construction; the drawback is a blow-up in query time. The proof of this theorem will be deferred to Section 3.1. Theor... |

192 | Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
- Grossi, Vitter
- 2000
(Show Context)
Citation Context ...e required is Ω(n log n) bits. On the other hand, when we optimize the construction working space, based on the recent work by Lam et al. [17], we can first build the Compressed Suffix Array (CSA) of =-=[6]-=- and then convert it to the suffix tree and the suffix array. Although such approach reduces the working space to O(n) bits, the execution time is increased to O(n log n). Another solution is to rely ... |

187 | Opportunistic data structures with applications
- Ferragina, Manzini
- 2000
(Show Context)
Citation Context ...ace. This completes the proof. ⊓⊔ Once the Burrows-Wheeler transformation is completed, FM-index can be created by encoding the transformed text W using Move-to-Front encoding and Run-Length encoding =-=[5]-=-. When the alphabet size is small, precisely, when |Σ| log |Σ| = O(log n), Move-to-Front encoding and Run-Length encoding can be done in O(n) time based on a pre-computed table of o(n) bits. In summar... |

181 |
Space-efficient static trees and graphs
- JACOBSON
- 1989
(Show Context)
Citation Context ...Moreover, an auxiliary data structure of O(n/ log log n) bits is constructed in O(n) time to enable constant-time rank and select queries, and thus supporting the retrieval of any qi in constant time =-=[12, 21]-=-. Then, the total size is n(log |Σ| + 2) + o(n) bits. Since qi and ri can be retrieved in constant time, so can Ψ ′ [i] = |Σ|qi + ri. This gives the following lemma. Lemma 1 The Ψ ′ function can be en... |

153 | Simple linear work suffix array construction - Kärkkäinen, Sanders - 2003 |

151 |
Data structures and algorithm 1: Sorting and Searching
- Mehlhorn
- 1984
(Show Context)
Citation Context ...g x), a data structure of size O(xb) bits supporting O(1)-time existential query can be constructed in O(x log x) time and O(xb)-bit working space. The second one is derived from adapting a result in =-=[20, 29]-=- based on Lemma 7. 3 The additional O(|∆|) bits are due to the dummy group Gm. 10sLemma 8 Given z w-bit numbers, where w = Θ(log z), a data structure of size O(zw 2 ) bits supporting O(log w)time rank... |

130 | Optimal suffix tree construction with large alphabets
- Farach
- 1997
(Show Context)
Citation Context ...both of them are either too slow, or require too much working space. For instance, when we optimize the construction time, based on the work from Weiner [28], McCreight [19], Ukkonen [27], and Farach =-=[4]-=-, a suffix tree and a suffix array can be built in O(n) time. However, the working space required is Ω(n log n) bits. On the other hand, when we optimize the construction working space, based on the r... |

122 |
Indexing compressed texts
- Ferragina, Manzini
(Show Context)
Citation Context ...ace. This completes the proof. ⊓⊔ Once the Burrows-Wheeler transformation is completed, FM-index can be created by encoding the transformed text W using Move-to-Front encoding and Run-Length encoding =-=[5]-=-. When the alphabet size is small, precisely, when |Σ| log |Σ| = O(log n), Move-to-Front encoding and Run-Length encoding can be done in O(n) time based on a pre-computed table of o(n) bits. In summar... |

119 | Reducing the space requirement of suffix trees
- Kurtz
- 1999
(Show Context)
Citation Context ...uppose we would like to construct a suffix array for human genome (of length approximately 3 billion). The fastest known algorithm runs in linear time. However, it requires 40 Gigabytes working space =-=[16]-=-. Such memory requirement far exceeds the capacity of ordinary computers. On the other hand, if we apply the most space-efficient algorithm, the working space required is roughly 3 Gigabytes, which is... |

117 |
Information Retrieval: Algorithms and Heuristics
- Grossman, Frieder
- 2004
(Show Context)
Citation Context ...onentially. To assist users to locate their required information, the role of indexing data structures has become more and more important. For texts with word boundary such as English, inverted index =-=[7]-=- is used since it enables fast queries and is space-efficient. However, for texts without word boundary like DNA/protein sequences or Chinese/Japanese texts, inverted index is not suitable. In this ca... |

89 |
New Text Indexing Functionalities of the Compressed Suffix Arrays
- Sadakane
(Show Context)
Citation Context ...tages. Another finding that leads to the improvement is related to the backward search algorithm, which is used to find a 3spattern within the text based on the Ψ function. If we apply a known method =-=[24]-=-, given the Ψ function for the text, each step of the algorithm requires O(log n) time in general. This paper presents an O(n)-bit auxiliary data structure which supports each backward search step in ... |

86 |
Log-logarithmic worst-case range queries are possible in space theta(n
- Willard
- 1983
(Show Context)
Citation Context ...g x), a data structure of size O(xb) bits supporting O(1)-time existential query can be constructed in O(x log x) time and O(xb)-bit working space. The second one is derived from adapting a result in =-=[20, 29]-=- based on Lemma 7. 3 The additional O(|∆|) bits are due to the dummy group Gm. 10sLemma 8 Given z w-bit numbers, where w = Θ(log z), a data structure of size O(zw 2 ) bits supporting O(log w)time rank... |

71 | Space efficient linear time construction of suffix arrays - Ko, Aluru |

60 | Optimal bounds for the predecessor problem
- Beame, Fich
- 1999
(Show Context)
Citation Context ...del. Firstly, we assume a unit-cost RAM with word size O(log U) bits, where n ≤ U, in which standard arithmetic and bitwise boolean operations on word-sized operands can be performed in constant time =-=[1, 9]-=-. We restrict our algorithms to be running within the main memory, in which no I/O operations are involved in the intermediate steps. For counting the working space, we do not include the space for th... |

56 | Fich, Optimal bounds for the predecessor problem and related problems
- Beame, E
(Show Context)
Citation Context .... Firstly, we assume a unit-cost RAM with word size of O(log U) bits, where n ≤ U, in which standard arithmetic and bitwise boolean operations on word-sized operands can be performed in constant time =-=[1, 9]-=-. Secondly, to compare our work fairly with the other main-memory algorithms, we add the following assumptions: (1) We restrict our algorithms to be running within the main memory, in which no I/O ope... |

53 | Linear-time construction of suffix arrays - Kim, Sim, et al. - 2003 |

53 | Low Redundancy in Static Dictionaries with Constant Query Time
- Pagh
(Show Context)
Citation Context ...ing space, under a reasonable assumption that log |Σ| = o(log n). Moreover, for the special case where log |Σ| = O((log log n) 1−ɛ ), we can apply Pagh’s data structure for constant-time rank queries =-=[22]-=- to further improve the running time to the optimal O(n). 1.2 The Main Techniques To achieve small working space, we make use of the Ψ function [6] of CSA and the Burrows-Wheeler (BW) text [2] as our ... |

46 |
Sorting and searching on the word RAM
- Hagerup
- 1998
(Show Context)
Citation Context ...del. Firstly, we assume a unit-cost RAM with word size O(log U) bits, where n ≤ U, in which standard arithmetic and bitwise boolean operations on word-sized operands can be performed in constant time =-=[1, 9]-=-. We restrict our algorithms to be running within the main memory, in which no I/O operations are involved in the intermediate steps. For counting the working space, we do not include the space for th... |

36 | Pagh: Deterministic Dictionaries
- Hagerup, Miltersen, et al.
- 2001
(Show Context)
Citation Context ...ent Rank Query This section is devoted to proving Theorem 1. We begin with two supporting lemmas. The first one is on perfect hash function, which is obtained by rephrasing the result of Section 4 of =-=[10]-=- as follows. Lemma 7 Given x b-bit numbers, where b = Θ(log x), a data structure of size O(xb) bits supporting O(1)-time existential query can be constructed in O(x log x) time and O(xb)-bit working s... |

27 | A space and time efficient algorithm for constructing compressed suffix arrays
- Lam, Sadakane, et al.
- 2002
(Show Context)
Citation Context ...fix array can be built in O(n) time. However, the working space required is Ω(n log n) bits. On the other hand, when we optimize the construction working space, based on the recent work by Lam et al. =-=[17]-=-, we can first build the Compressed Suffix Array (CSA) of [6] and then convert it to the suffix tree and the suffix array. Although such approach reduces the working space to O(n) bits, the execution ... |

22 | Theoretical and experimental study on the construction of suffix arrays in external memory
- Crauser, Ferragina
(Show Context)
Citation Context ... log n), 1 which is only a bit slower. One more advantage of suffix array is that even if this indexing structure is placed in external memory, it still can achieve good I/O performance for searching =-=[3]-=-. Despite their efficiency in searching, suffix trees and suffix arrays cannot be built easily when n is large. The construction algorithms for both of them are either too slow, or require too much wo... |

20 |
Guidelines for Presentation and Comparison of Indexing Techniques
- Zobel, Moffat, et al.
- 1996
(Show Context)
Citation Context ... array in O(n) time and O(n)-bit working space; 1 We use the notation log c b n = (log n/ log b)c to denote the c-th power of the base-b logarithm of n. Unless specified, we use b = 2. 2 Zobel et al. =-=[30]-=- and Crauser and Ferragina [3] both mentioned the importance of construction algorithms to the usefulness of the index. 2sTable 1: Construction times for full-text indices. index algorithm time space ... |

18 | Efficient Discovery of Optimal Word Association Patterns in Large Text Databases
- Shimozono, Arimura, et al.
(Show Context)
Citation Context ...ent of Computer Science, School of Computing, National University of Singapore, Singapore. (Email: ksung@comp.nus.edu.sg) 1sliterature, which find numerous applications in areas including data mining =-=[26]-=- and biological research [8]. For the rest of the full-text indices, almost all of them are originated from these two data structures. Suffix trees and suffix arrays are very useful since they allow u... |

12 | Space-Economical Algorithms for Finding Maximal Unique Matches
- Hon, Sadakane
- 2002
(Show Context)
Citation Context ...imes for full-text indices. index algorithm time space (bits) SA opt time [18] O(n) O(n log n) CSA opt space [17] O(n log n) O(n) FM this paper O(n) O(n) ST CST opt time [4] O(n) O(n log n) opt space =-=[11]-=- O(n log n) O(n) this paper O(n log ɛ n) O(n) The acronym ST, SA, CST, CSA, FM represent suffix tree, suffix array, compressed suffix tree, compressed suffix array, and FM-index, respectively. Note th... |

12 |
Linear-time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets
- Na
- 2005
(Show Context)
Citation Context ... ), we can apply Pagh’s data structure for constant-time rank queries [24] to further improve the running time of the suffix array construction to the optimal O(n). Remark. Very recently, Na and Park =-=[23]-=- proposed another algorithm for the construction of CSA, FM-index, and Burrows-Wheeler transform. The running time is O(n) time, which is independent of the alphabet size. The working space is increas... |

10 |
The bzip2 and libbzip2 official home
- Seward
- 2002
(Show Context)
Citation Context ...nother application of our algorithm is that, it can act as a time and space efficient algorithm for the block sorting [2], which is a widely used process in various compression schemes, such as bzip2 =-=[27]-=-. We also study the general case where the alphabet size is not constant. Let Σ be the alphabet, and |Σ| denote its size. Our algorithm can construct the suffix array and the suffix tree using O(n log... |

7 |
The bzip2 and libbzip2 official home page
- Seward
- 1996
(Show Context)
Citation Context ...Another application of our algorithm is that, it can act as a time and space efficient algorithm for the block sorting [2], which is a widely used process in various compression schemes, such asbzip2 =-=[25]-=-. We also study the general case where the alphabet size is not constant. Let Σ be the alphabet, and |Σ| denote its size. Our algorithm can construct the suffix array and the suffix tree using O(n log... |

3 |
On the Construction and Application of Compressed Text Indexes
- Hon
- 2004
(Show Context)
Citation Context ... of the input text, (2) parentheses encoding of the tree structure of the suffix tree, and (3) an Hgt array that enables efficient computation of the longest common prefix (LCP) query. It is shown in =-=[11, 12]-=- that once the CSA of the input text is computed, the CST can be constructed in O(n log ɛ n) time and O(n log |Σ|)-bit working space, for any fixed ɛ with 0 < ɛ < 1. Once the CST is constructed, we ca... |

1 | Compressed Suffix Arrays andSuffix Trees with Applications to Text Indexing and String Matching. Manuscript (Submitted for publication), 2001. [8 - Grossi, Vitter - 1998 |

1 | Space-Economical Algorithmsfor Finding Maximal Unique Matches - Hon, Sadakane - 2002 |

1 | Guidelines forPresentation and Comparison of Indexing Techniques - Zobel, Moffat, et al. - 1996 |