## When indexing equals compression: Experiments with compressing suffix arrays and applications (2004)

Citations: | 43 - 5 self |

### BibTeX

@INPROCEEDINGS{Grossi04whenindexing,

author = {Roberto Grossi and Ankur Gupta and Jeffrey Scott Vitter},

title = {When indexing equals compression: Experiments with compressing suffix arrays and applications},

booktitle = {},

year = {2004},

pages = {636--645}

}

### Years of Citing Articles

### OpenURL

### Abstract

We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work [16], yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20 % of the original text size—without requiring a separate instance of the text—and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. 1

### Citations

801 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ...quence analysis in computational biology [18]. Inverted files do not offer as much functionality, but they provide excellent index compression, requiring about 0.15n log |Σ| bits of space in practice =-=[24, 38]-=-. However, inverted files also require a separate copy of the text. In terms of functionality, inverted files support efficient search only for words (or parts of words) in the text; they cannot searc... |

646 | Su#x arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...mpressed text is described in [27]. The suffix tree is a much more powerful text index (in the form of a compact trie) whose leaves store each of the n suffixes contained in the text T . Suffix trees =-=[21, 23]-=- allow fast, general search of patterns ∗ Dipartimento di Informatica, Università di Pisa, via F. Buonarroti 2, 56127 Pisa (grossi@di.unipi.it). Support was provided in part by the Italian MIUR projec... |

549 |
A space-economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ...mpressed text is described in [27]. The suffix tree is a much more powerful text index (in the form of a compact trie) whose leaves store each of the n suffixes contained in the text T . Suffix trees =-=[21, 23]-=- allow fast, general search of patterns ∗ Dipartimento di Informatica, Università di Pisa, via F. Buonarroti 2, 56127 Pisa (grossi@di.unipi.it). Support was provided in part by the Italian MIUR projec... |

197 | An introduction to arithmetic coding
- Langdon
- 1984
(Show Context)
Citation Context ...dicate its value as a compression mechanism dependent on the Φ transformation of the data. In the table, wave refers to the wavelet tree built on the original text; arit refers to the arithmetic code =-=[33]-=-; bzip2 version 1.0.2 is the Unix implementation of block sorting based on the Burrows-Wheeler transform; gzip is version 1.3.5; lha is version 1.14i [3]; vh1 is Karl Malbrain and David Scott’s implem... |

193 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...ffrey Scott Vitter ‡ Abstract We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees similar to those in our earlier work =-=[16]-=-, yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size... |

193 | Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ...B, i) returns the position of the ith 1 in B. (Analogous definitions hold for 0s.) The ith bit in B can be computed as B[i] = rank 1(i) − rank 1(i − 1). The constant-time fully indexable dictionaries =-=[31]-=- support the full repertoire of rank and select for both 0s and 1s in � log � 2.1 Practical Dictionaries In this section, we explore practical alternatives to dictionaries for compressed text indexing... |

188 | Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
- Grossi, Vitter
(Show Context)
Citation Context ...ficant factor in compression. Apparently, this issue has been somehow obscured in previous literature. Now we prove our main theorem in this section, which describes how to encode the Φ function from =-=[17]-=-. The neighbor function Φ is nothing more than the inverse of the LF mapping from the Burrows-Wheeler transform. It encodes for each position, in terms of suffix arrays, the location of the next suffi... |

184 | The lca problem revisited
- Bender, Farach-Colton
- 2000
(Show Context)
Citation Context ...loyed a simple heuristic that introduces an arbitrarily chosen parameter S = O(log n) that represents the slowdown factor. We implement part of the lowest common ancestor simplification introduced in =-=[7]-=-. We use our dictionaries and sparsification of the entries, sped up with tricks to take advantage of parallelism in modern processors. Once we have this, we can use just O(1) additional words to get ... |

180 | Opportunistic Data Structures with Application
- Ferrragina, Manzini
- 2000
(Show Context)
Citation Context ...) time, but they require a copy of the text; the space cost is only n log n bits (which in some cases can be reduced about 40%). Compressed suffix arrays [17, 32, 34, 36] and opportunistic FM-indexes =-=[11, 12]-=- represent new trends in the design of advanced indexes for full-text searching of documents, in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than c... |

154 | Fractional cascading: I. A data structuring technique
- Chazelle, Guibas
- 1986
(Show Context)
Citation Context ...ming a sequence of select operations along an upward traversal of p nodes/dictionaries through the tree. We reduce the O(p log t) cost to O(p + log t) by using an idea similar to fractional cascading =-=[10]-=-. Suppose a dictionary D is the child of dictionary D ′ in the wavelet tree. Suppose we have just performed a binary search in A0 of D. We can predict the position in A0 of D ′ to continue searching. ... |

149 |
A locally adaptive data compression scheme
- Bentley, Sleator, et al.
- 1986
(Show Context)
Citation Context ...se these dictionaries (organized in a wavelet tree) to achieve a simplified “encoding” for high-order contexts, along with run-length encoding (RLE) and γ encoding. This shows that Moveto-Front (MTF) =-=[8]-=-, arithmetic and Huffman encoding are not strictly necessary to achieve high-order compression with BWT. We then extend the wavelet tree so that its search can be sped up by fractional cascading and e... |

148 | Self-indexing inverted files for fast text retrieval
- Moffat, Zobel
- 1996
(Show Context)
Citation Context ...quence analysis in computational biology [18]. Inverted files do not offer as much functionality, but they provide excellent index compression, requiring about 0.15n log |Σ| bits of space in practice =-=[24, 38]-=-. However, inverted files also require a separate copy of the text. In terms of functionality, inverted files support efficient search only for words (or parts of words) in the text; they cannot searc... |

140 | Succinct Representation of Balanced Parentheses and Static Trees
- Munro, Raman
(Show Context)
Citation Context ... also retrieve any individual bit in constant time. Hence, succinct dictionaries are not only of theoretical interest but they provide the basis for space-efficient representation of trees and graphs =-=[19, 25]-=-. Recently, dictionaries have been shown to be crucial for text indexing data structures [16]. Specifically, the data structuring framework in [16] uses suffix arrays to transform dictionaries into hi... |

117 | Reducing the space requirement of suffix trees
- Kurtz
- 1999
(Show Context)
Citation Context ...and performs better on bible.txt and ap90-64.txt. 4 An Application: Space-Efficient Suffix Trees In this section, we apply our ideas to the implementation of a space-efficient version of suffix trees =-=[20]-=-. We consider more than just the problem of searching, as suffix trees are at the heart of many algorithms on strings and sequences, so their full functionality is needed [18]. From a theoretical poin... |

65 | An Experimental Study of an Opportunistic Index
- Ferrragina, Manzini
- 2001
(Show Context)
Citation Context ...) time, but they require a copy of the text; the space cost is only n log n bits (which in some cases can be reduced about 40%). Compressed suffix arrays [17, 32, 34, 36] and opportunistic FM-indexes =-=[11, 12]-=- represent new trends in the design of advanced indexes for full-text searching of documents, in that they support the functionalities of suffix arrays and suffix trees, which are more powerful than c... |

63 |
Membership in constant time and almostminimum space
- Brodnik, Munro
- 1999
(Show Context)
Citation Context ...ve, and alternative compression. In particular, such dictionaries avoid a complete sequential scan of the data when retrieving portions of it. 2 A Simple Yet Powerful Dictionary Succinct dictionaries =-=[9, 30]-=- are constant-time rank and select data structures occupying tiny space. They store t entries chosen from a bounded universe [0 . . . n − 1] (or any translation of it) in � log � �� n t ≤ n bits, plus... |

60 | Compressing integers for fast file access - Williams, Zobel - 1999 |

59 | Compressed text databases with efficient query algorithms based on the compressed suffix array
- Sadakane
- 1969
(Show Context)
Citation Context ...at searching. Their search time is O(m + log n) time, but they require a copy of the text; the space cost is only n log n bits (which in some cases can be reduced about 40%). Compressed suffix arrays =-=[17, 32, 34, 36]-=- and opportunistic FM-indexes [11, 12] represent new trends in the design of advanced indexes for full-text searching of documents, in that they support the functionalities of suffix arrays and suffix... |

50 | Succinct representations of LCP information and improvements in the compressed suffix arrays
- Sadakane
- 2002
(Show Context)
Citation Context ...at searching. Their search time is O(m + log n) time, but they require a copy of the text; the space cost is only n log n bits (which in some cases can be reduced about 40%). Compressed suffix arrays =-=[17, 32, 34, 36]-=- and opportunistic FM-indexes [11, 12] represent new trends in the design of advanced indexes for full-text searching of documents, in that they support the functionalities of suffix arrays and suffix... |

49 | Adding compression to block addressing inverted indices. Inf. Retrieval
- NAVARRO, MOURA, et al.
- 2000
(Show Context)
Citation Context ...tten in Eastern languages, or phrase searching [1]. An efficient combination of inverted file compression, block addressing, and sequential search on wordbased Huffman compressed text is described in =-=[27]-=-. The suffix tree is a much more powerful text index (in the form of a compact trie) whose leaves store each of the n suffixes contained in the text T . Suffix trees [21, 23] allow fast, general searc... |

49 | Low redundancy in static dictionaries with constant query time
- Pagh
- 2001
(Show Context)
Citation Context ...ve, and alternative compression. In particular, such dictionaries avoid a complete sequential scan of the data when retrieving portions of it. 2 A Simple Yet Powerful Dictionary Succinct dictionaries =-=[9, 30]-=- are constant-time rank and select data structures occupying tiny space. They store t entries chosen from a bounded universe [0 . . . n − 1] (or any translation of it) in � log � �� n t ≤ n bits, plus... |

44 | The enhanced suffix array and its applications to genome analysis
- Abouelhoda, Kurtz, et al.
- 2002
(Show Context)
Citation Context ...the suffix tree for the human genome). We exploit a folklore relationship between suffix tree nodes and intervals in the suffix array, which has been also used recently to devise efficient algorithms =-=[4, 5, 6]-=-. For each node u, there are two integers 1 ≤ ul ≤ ur ≤ n such that SA[ul . . . ur] contains all the suffixes stored in the leaves descending from u. Thus, u ≡ (ul, ur, ℓu) is a triple of integers in ... |

38 | Optimal Exact String Matching Based on Suffix Arrays
- Abouelhoda, EnnoOhlebusch, et al.
- 2002
(Show Context)
Citation Context ...the suffix tree for the human genome). We exploit a folklore relationship between suffix tree nodes and intervals in the suffix array, which has been also used recently to devise efficient algorithms =-=[4, 5, 6]-=-. For each node u, there are two integers 1 ≤ ul ≤ ur ≤ n such that SA[ul . . . ur] contains all the suffixes stored in the leaves descending from u. Thus, u ≡ (ul, ur, ℓu) is a triple of integers in ... |

15 |
New indices for text: Pat trees and pat arrays. Information retrieval: data structures and algorithms
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...s another well-known index structure. It maintains the permuted order of 1, 2, . . . , n that corresponds to the locations of the suffixes of the text in lexicographically sorted order. Suffix arrays =-=[15, 21]-=- (also storing the length of the longest common prefix) are nearly as good at searching. Their search time is O(m + log n) time, but they require a copy of the text; the space cost is only n log n bit... |

13 |
Time-space trade-offs for compressed suffix arrays
- Rao
(Show Context)
Citation Context ...at searching. Their search time is O(m + log n) time, but they require a copy of the text; the space cost is only n log n bits (which in some cases can be reduced about 40%). Compressed suffix arrays =-=[17, 32, 34, 36]-=- and opportunistic FM-indexes [11, 12] represent new trends in the design of advanced indexes for full-text searching of documents, in that they support the functionalities of suffix arrays and suffix... |

12 | Compression boosting in optimal linear time using the Burrows-Wheeler Transform
- Ferragina, Manzini
- 2004
(Show Context)
Citation Context ...arithmetic, or multi-table Huffman encoding. Giancarlo and Sciortino [14] also avoided using the MTF encoder, but it came at the price of a quadratic dynamic programming scheme. Ferragina and Manzini =-=[13]-=- recently devised a linear-time method to partition BWT optimally for any given H0 compressor, so as to achieve high-order entropy without using a MTF encoding. Moreover, finding a tighter encoding th... |

8 |
Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms
- GIANCARLO, SCIORTINO
- 2003
(Show Context)
Citation Context ... structures. Said more mathematically, we can split the cost in [16] as nHh + M(h), where M(h) refers to the overhead necessary to encode a statistical model for contexts of length h. In other papers =-=[14, 22]-=-, it is assumed that M(h) is a constant bounded by O(Σ h ). However, this assumption fails for sufficiently large values in our experiments (h ≥ 4). In fact, it is trivial to show that for sufficientl... |

7 | The lz-index: A text index based on the ziv-lempel trie
- Navarro
- 2003
(Show Context)
Citation Context ... devised the FM-index [11, 12], which is based on the Burrows-Wheeler transform (BWT) and is the first to encode the index size with respect to the hth-order empirical entropy Hh of the text. Navarro =-=[28]-=- recently developed an index requiring 4nHh + o(n) bits, and boasts fast search. Grossi, Gupta, and Vitter [16] exploited the higher-order entropy Hh of the text to represent a compressed suffix array... |

6 | Efficient discovery of proximity patterns with suffix arrays
- Arimura, Asaka, et al.
- 2001
(Show Context)
Citation Context ...the suffix tree for the human genome). We exploit a folklore relationship between suffix tree nodes and intervals in the suffix array, which has been also used recently to devise efficient algorithms =-=[4, 5, 6]-=-. For each node u, there are two integers 1 ≤ ul ≤ ur ≤ n such that SA[ul . . . ur] contains all the suffixes stored in the leaves descending from u. Thus, u ≡ (ul, ur, ℓu) is a triple of integers in ... |

4 |
An analysis of the Burrows — Wheeler transform
- Manzini
- 2001
(Show Context)
Citation Context ... structures. Said more mathematically, we can split the cost in [16] as nHh + M(h), where M(h) refers to the overhead necessary to encode a statistical model for contexts of length h. In other papers =-=[14, 22]-=-, it is assumed that M(h) is a constant bounded by O(Σ h ). However, this assumption fails for sufficiently large values in our experiments (h ≥ 4). In fact, it is trivial to show that for sufficientl... |

2 |
Srinivasa Rao S. Space Efficient Suffix Trees
- Munro, Raman
(Show Context)
Citation Context ...est common prefix (LCP) information, which requires at least 6n bits [36]. As an alternative, the same information can be maintained in at least 4n bits to retain the tree shape of at most 2n−1 nodes =-=[26]-=-, though there is a slowdown since the LCP information is not stored explicitly. In either case, a separate (compressed) suffix array is needed. As a result, the best theoretical representation of suf... |

1 |
Space-efficient static trees graphs. FOCS
- Jacobson
- 1989
(Show Context)
Citation Context ... also retrieve any individual bit in constant time. Hence, succinct dictionaries are not only of theoretical interest but they provide the basis for space-efficient representation of trees and graphs =-=[19, 25]-=-. Recently, dictionaries have been shown to be crucial for text indexing data structures [16]. Specifically, the data structuring framework in [16] uses suffix arrays to transform dictionaries into hi... |

1 |
Run length encoding/RLE
- Nelson
(Show Context)
Citation Context .... Each of the encoding schemes is used in conjunction with RLE (unless noted otherwise) to provide the results in the table. Golomb uses the median value as its parameter b. Maniscalco refers to code =-=[29]-=- that is specially tailored for RLE in the Burrows-Wheeler transform (BWT). Bernoulli is the skewed Bernoulli model with the median value as its parameter b. MixBernoulli uses just one bit to encode g... |