## High-order entropy-compressed text indexes (2003)

Citations: | 198 - 22 self |

### BibTeX

@INPROCEEDINGS{Grossi03high-orderentropy-compressed,

author = {Roberto Grossi and Ankur Gupta and Jeffrey Scott Vitter},

title = {High-order entropy-compressed text indexes},

booktitle = {},

year = {2003},

pages = {841--850}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg |Σ | bits. We show that compressed suffix arrays use just nHh + O(n lg lg n / lg |Σ | n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |Σ | + polylog(n)) time. The term Hh ≤ lg |Σ | denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hh = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper. 1

### Citations

848 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ... need to keep the text itself. However, for moderately large alphabets, these schemes lose sublinear space complexity even if the text is compressible. Large alphabets are typical of phrase searching =-=[5, 21]-=-, for example, in which the alphabet is made up of single words and its size cannot be considered a small constant. 1.2 Our Results In this paper, we develop selfindexing data structures that retain f... |

666 | Suffix arrays: A new method for on-line string searches, siam
- Manber, Myers
- 1993
(Show Context)
Citation Context ...rt rank 0 and select 0, we can use the fully-indexable version of their structure, called an fid, requiring � lg � �� n t + O(n lg lg n/ lg n) bits. 2 Compressed Suffix Arrays A standard suffix array =-=[4, 10]-=- is an array containing the position of each of the n suffixes of text T in lexicographical order. In particular, SA[i] is the starting position in T of the ith suffix in lexicographical order. The si... |

562 | A space-economic suffix tree construction algorithm - McCreight - 1976 |

199 | Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets
- Raman, Raman, et al.
- 2002
(Show Context)
Citation Context ...-order entropy to the succinct dictionary problem, in which t keys over a bounded universe n are stored� in the information theoretically minimum space, lg � n bits, plus lower-order terms (e.g., see =-=[16, 17]-=-). Our main result is a new implementation of compressed suffix arrays that exhibits several tradeoffs between occupied space and search/decompression time. In one tradeoff, we can implement the compr... |

192 | Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract
- Grossi, Vitter
- 2000
(Show Context)
Citation Context ...text, which is unusual in classical compression schemes. 1.1 Related Work A new trend in the design of advanced indexes for full-text searching of documents is represented by compressed suffix arrays =-=[6, 18, 19, 20]-=- and opportunistic FM-indexes [2, 3], in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]. (An efficient combination o... |

187 | Opportunistic data structures with applications
- Ferragina, Manzini
- 2000
(Show Context)
Citation Context ...on schemes. 1.1 Related Work A new trend in the design of advanced indexes for full-text searching of documents is represented by compressed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes =-=[2, 3]-=-, in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]. (An efficient combination of inverted file compression, block a... |

183 | Patricia - practical algorithm to retrieve information coded in alphanumeric - Morrison - 1968 |

181 | Space-efficient Static Trees and Graphs - Jacobson - 1989 |

156 | Self-indexing inverted files for fast text retrieval
- Moffat, Zobel
- 1996
(Show Context)
Citation Context ... context x create a contiguous sequence of positions from the suffix array. For instance, along the fourth row of the table above for x = ba, there are 10 entries that are contiguous and in the range =-=[14, 23]-=-. The conditional probability that a precedes context ba is 3/10, that b precedes context ba is 7/10, while that of # preceding context ba is 0. As a result, we show in Section 3.5 that encoding each ... |

66 | An experimental study of an opportunistic index - Ferragina, Manzini |

62 | Compressed text databases with efficient query algorithms based on the compressed suffix array
- Sadakane
- 2000
(Show Context)
Citation Context ...text, which is unusual in classical compression schemes. 1.1 Related Work A new trend in the design of advanced indexes for full-text searching of documents is represented by compressed suffix arrays =-=[6, 18, 19, 20]-=- and opportunistic FM-indexes [2, 3], in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files [4]. (An efficient combination o... |

58 | A space-economical su'x tree construction algorithm - McCreight - 1976 |

56 | Space efficient suffix trees
- Munro, Raman, et al.
(Show Context)
Citation Context ... where a < b < # and # is a special end-of-text symbol. Here, Φ0(4) = 16, since SA0[4] = 17 and SA0[16] = 17 + 1 = 18. To retrieve SA0[16], since B0[16] = 1, we compute 2 · SA1[rank(B0, 16)] = 2 · SA1=-=[12]-=- = 2 · 9 = 18. To retrieve SA0[4], since B0[4] = 0, we compute SA0[Φ0(4)] − 1 = SA0[16] − 1 = 18 − 1 = 17. The representation of Bk and rank(Bk, i) uses the methods of [7, 13, 16]. The major hurdle re... |

53 | Low Redundancy in Static Dictionaries with Constant Query Time
- Pagh
(Show Context)
Citation Context ...-order entropy to the succinct dictionary problem, in which t keys over a bounded universe n are stored� in the information theoretically minimum space, lg � n bits, plus lower-order terms (e.g., see =-=[16, 17]-=-). Our main result is a new implementation of compressed suffix arrays that exhibits several tradeoffs between occupied space and search/decompression time. In one tradeoff, we can implement the compr... |

51 | Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays
- Sadakane
- 2002
(Show Context)
Citation Context ...osition x+j in the compressed suffix array (recall that SA[x+j] contains the starting position of Sj in the text). Decompressing one text symbol of Sj at a time is inherently sequential as in [2] and =-=[19, 20]-=-. But steps 2–3 of our search tool require us to start decompressing from the kth symbol of suffix Sj, rather than the first, which could cost us O(mr) time! Fortunately, we can overcome this problem ... |

49 | Adding compression to block addressing inverted indexes. Information Retrieval
- Navarro, Moura, et al.
- 2000
(Show Context)
Citation Context ...e more powerful than classical inverted files [4]. (An efficient combination of inverted file compression, block addressing and sequential search on word-based Huffman compressed text is described in =-=[15]-=-.) They overcome the well-known space limitations by exploiting, in a novel way, the notion of text compressibility and the techniques developed for succinct data structures and bounded-universe dicti... |

43 |
New Indices for text: PAT trees and PAT arrays
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...ed suffix arrays [6, 18, 19, 20] and opportunistic FM-indexes [2, 3], in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than classical inverted files =-=[4]-=-. (An efficient combination of inverted file compression, block addressing and sequential search on word-based Huffman compressed text is described in [15].) They overcome the well-known space limitat... |

36 | A Suboptimal Lossy Data Compression Based in Approximate Pattern
- Luczak, Szpankowski
- 1997
(Show Context)
Citation Context ... and Ziv have provided an encoding such that h ≈ α lg n (where 0 < α < 1) is sufficiently good for approximating H with Hh; Luczak and Szpankowski prove a sufficient approximation when h = O(lg n) in =-=[8]-=-. In order to support fast searching, an index can be formed by preprocessing the text T . For any query power of the base-b logarithm of n. Unless specified, we use b = 2. 2 The standard definition c... |

15 | Compressed sux arrays and sux trees with applications to text indexing and string matching - Grossi, Vitter - 2000 |

13 | Time-space trade-offs for compressed suffix arrays - Rao |

8 | Bit-Tree: A Data Structure for Fast File Processing - Ferguson - 1992 |

8 | Compressed text databases with ecient query algorithms based on the compressed sux array - Sadakane - 2000 |

6 | Space ecient sux trees - Munro, Raman, et al. - 2001 |

3 | Self-indexing inverted for fast text retrieval - Moat, Zobel - 1996 |

1 | Bit-Tree: a data structure for fast processing - Ferguson - 1992 |

1 | Time-space trade-os for compressed sux arrays - Rao - 2002 |