Succinct suffix arrays based on run-length encoding (2005)
Cached
Download Links
- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- [www.dcc.uchile.cl]
- DBLP
Other Repositories/Bibliography
| Venue: | Nordic Journal of Computing |
| Citations: | 46 - 32 self |
BibTeX
@ARTICLE{Mäkinen05succinctsuffix,
author = {Veli Mäkinen and Gonzalo Navarro},
title = {Succinct suffix arrays based on run-length encoding},
journal = {Nordic Journal of Computing},
year = {2005},
pages = {40--66}
}
Years of Citing Articles
OpenURL
Abstract
A succinct full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2... pm in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n). In this paper we present a new self-index, called RLFM index for “run-length FM-index”, that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHk log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ. In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the Burrows-Wheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes. Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different space-time tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those. 1







