## Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays (2004)

Citations: | 17 - 11 self |

### BibTeX

@MISC{Mäkinen04advantagesof,

author = {Veli Mäkinen and Gonzalo Navarro and Kunihiko Sadakane},

title = {Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays},

year = {2004}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H0 + O(log log σ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zero-order entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does not need the text separately available to operate. The CSA simulates a binary search over the suffix array, where the query is compared against text substrings. These are extracted from the same CSA by following irregular access patterns over the structure. Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000]. He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA. The most remarkable one is that we do not need, unlike all previous proposals, any complicated sub-linear structures based on the four-Russians technique (such as constant time rank and select queries on bit arrays). We show that sampling and compression are enough to achieve O(m log n) query time using less space than the original structure. It is also possible to trade structure space for search time. Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m log B n) disk accesses, being B the disk block size. Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.

### Citations

1193 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ...munication need. Upon completing the search, it sends R(pm−1pm) to the processor responsible for pm−2 and so on. After m communication steps exchanging O(1) data, we have the answer. In the BSP model =-=[24]-=-, we need m supersteps of O(log n) CPU work and O(1) communication each. In comparison, the CSA needs O(m log n) supersteps of O(1) CPU and communication each, and the basic suffix array needs O(log n... |

686 | Suffix arrays: a new method for on-line string searches
- Manber, EW
- 1993
(Show Context)
Citation Context ... the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix array =-=[10]-=-. Still, the space complexity does not change. Moreover, the searches take O(m log n) time with the suffix array (this can be improved to O(m + log n) using twice the original amount of space [10]). T... |

574 |
A Space-Economical Suffix Tree Construction Algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr... |

452 | Linear Pattern Matching Algorithm
- Weiner
- 1973
(Show Context)
Citation Context ...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr... |

208 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o... |

195 |
Managing Gigabytes
- Witten, Bell
- 1999
(Show Context)
Citation Context ...sing Elias encoding. We now give a simple encoding that slightly improves the space complexity of the original CSA. The differences Ψ(i) − Ψ(i − 1) can be encoded efficiently using Elias delta coding =-=[26]-=-. Let b(p) be the binary string representation of a number p. We use 1 |b(r)| 0b(r)b(p) to encode p, where r = |b(p)|. The length of the encoding is log(2 log p + 1) + 1 + log p = log p(1 + o(1)). The... |

193 | Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String
- Grossi, Vitter
- 2000
(Show Context)
Citation Context ...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o... |

142 | An analysis of the Burrows-Wheeler transform - Manzini - 2001 |

98 |
Compact Pat Trees
- Clark
- 1996
(Show Context)
Citation Context ...n of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique [1] to support constant time rank and select queries on bit arrays =-=[8,13,2]-=- (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the beginning). We show that these structures are not needed for an efficient ... |

73 |
On-line construction of suffix-trees
- Ukkonen
- 1995
(Show Context)
Citation Context ...ffix of T is titi+1 . . . tn). These kind of indexes are called full-text indexes. Optimal query time, which is O(m) as every character of P must be examined, can be achieved by using the suffix tree =-=[25,12,23]-=- as the index. The suffix tree takes much more memory than the text. In general, it takes O(n log n) bits, while the text takes n log σ bits 4 . A smaller constant factor is achieved by the suffix arr... |

67 |
On economical construction of the transitive closure of a directed graph
- Arlazarov, Dinic, et al.
- 1970
(Show Context)
Citation Context ... most important consequence of this is that a simpler implementation of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique =-=[1]-=- to support constant time rank and select queries on bit arrays [8,13,2] (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the be... |

67 | Indexing text using the Ziv–Lempel trie
- Navarro
- 2004
(Show Context)
Citation Context ...e index. Existence and counting queries on the CSA take O(m log n) time. There are also other so-called succinct full-text indexes that achieve good tradeoffs between search time and space complexity =-=[3,9, 7,22,5, 14,18,16,4]-=-. Most of these are opportunistic as they take less space than the text itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to o... |

64 |
Succinct static data structures
- Jacobson
- 1988
(Show Context)
Citation Context ...n of the CSA is possible: All previous proposals heavily rely on sublinear structures based on the so-called four-Russians technique [1] to support constant time rank and select queries on bit arrays =-=[8,13,2]-=- (rank(i) to find out how many bits are set before position i, and select(j) to find out the position of the jth bit from the beginning). We show that these structures are not needed for an efficient ... |

63 | Compressed text databases with efficient query algorithms based on the compressed suffix array
- Sadakane
- 2000
(Show Context)
Citation Context ...e requirement of full-text indexes has raised the interest on indexes that occupy the same amount of space as the text itself, or even less. For example, the Compressed Suffix Array (CSA) of Sadakane =-=[19]-=- takes in practice the same amount of space as the text compressed with a zero-order model. Moreover, the CSA does not need the text at all, since the text is included in the index. Existence and coun... |

51 | Succinct representations of lcp information and improvements in the compressed sux arrays
- Sadakane
- 2002
(Show Context)
Citation Context ...red suffix. The extraction ends at letter G, and hence the suffix does not correspond to an occurrence, and the search is continued to the left of the current point. 4 Backward Search on CSA Sadakane =-=[20]-=- has proposed using backward search on the CSA. Let us review how this search proceeds. We use the notation R(X), for a string X, to denote the range of suffix array positions corresponding to suffixe... |

46 | An alphabet-friendly FM-index, in
- Ferragina, Manzini, et al.
(Show Context)
Citation Context |

43 | When indexing equals compression: experiments with compressing sux arrays and applications
- Grossi, Gupta, et al.
- 2004
(Show Context)
Citation Context ...itself, and also self-indexes as they contain enough information to reproduce the text: A self-index does not need the text to operate. Recently, several space-optimal self-indexes have been proposed =-=[5,6, 4]-=-, whose space requirement depends on the k-th order empirical entropy with constant factor one (except for the sub-linear parts). These indexes achieve good query performances in theory, but they are ... |

32 | Compressed compact suffix arrays
- Mäkinen, Navarro
- 2004
(Show Context)
Citation Context |

27 | Repetition-based text indexes
- Kärkkäinen
- 1999
(Show Context)
Citation Context |

20 |
Compact suffix array — a space-efficient full-text index
- Mäkinen
- 2003
(Show Context)
Citation Context |

19 | Constructing compressed suffix arrays with large alphabets
- Hon, Lam, et al.
- 2003
(Show Context)
Citation Context ...kes still O(log B n) time, but we perform only ⌈m/ℓ⌉ of them. One obstacle to a secondary memory CSA implementation might be in building such a large CSA. This issue has been addressed satisfactorily =-=[21]-=-. 8 A Distributed Implementation Distributed implementations of suffix arrays face the problem that not only the suffix array, but also the text, are distributed. Hence, even if we distribute suffix a... |

13 |
Time-space trade-offs for compressed suffix arrays
- Rao
(Show Context)
Citation Context |

6 | G.: Distributed query processing using suffix arrays
- Marín, Navarro
- 2003
(Show Context)
Citation Context ...ext, are distributed. Hence, even if we distribute suffix array A according to lexicographical intervals, the processor performing the local binary search will require access to remote text positions =-=[17]-=-. Although some heuristics have been proposed, log n remote requests for m characters each are necessary in the worst case. The original CSA does not help solve this. If array Ψ is distributed, we wil... |

1 | New search algorithms and space/time tradeoffs for succinct suffix arrays - Mäkinen, Navarro |