## Implicit compression boosting with applications to self-indexing (2007)

Venue: | In Proc. SPIRE'07, LNCS 4726 |

Citations: | 29 - 16 self |

### BibTeX

@INPROCEEDINGS{Mäkinen07implicitcompression,

author = {Veli Mäkinen and Gonzalo Navarro},

title = {Implicit compression boosting with applications to self-indexing},

booktitle = {In Proc. SPIRE'07, LNCS 4726},

year = {2007},

pages = {229--241}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. Compression boosting (Ferragina & Manzini, SODA 2004) is a new technique to enhance zeroth order entropy compressors ’ performance to k-th order entropy. It works by constructing the Burrows-Wheeler transform of the input text, finding optimal partitioning of the transform, and then compressing each piece using an arbitrary zeroth order compressor. The optimal partitioning has the property that the achieved compression is boosted to k-th order entropy, for any k. The technique has an application to text indexing: Essentially, building a wavelet tree (Grossi et al., SODA 2003) for each piece in the partitioning yields a k-th order compressed full-text self-index providing efficient substring searches on the indexed text (Ferragina et al., SPIRE 2004). In this paper, we show that using explicit compression boosting with wavelet trees is not necessary; our new analysis reveals that the size of the wavelet tree built for the complete Burrows-Wheeler transformed text is, in essence, the sum of those built for the pieces in the optimal partitioning. Hence, the technique provides a way to do compression boosting implicitly, with a trivial linear time algorithm, but fixed to a specific zeroth order compressor (Raman et al., SODA 2002). In addition to having these consequences on compression and static full-text self-indexes, the analysis shows that a recent dynamic zeroth order compressed self-index (Mäkinen & Navarro, CPM 2006) occupies in fact space proportional to k-th order entropy. 1

### Citations

644 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1990
(Show Context)
Citation Context ... There are several classical full-text indexes requiring O(n log n) bits of space which can answer counting queries in O(m log σ) time (like suffix trees [1]) or O(m + log n) time (like suffix arrays =-=[18]-=-). Both locate each occurrence in constant time once the ⋆ Funded by the Academy of Finland under grant 108219. ⋆⋆ Partially Funded by Fondecyt Grant 1-050493, Chile.counting is done. Similar complex... |

565 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...y to work without the text and even fully replace it, by delivering any text substring without accessing T. The main building blocks in compressed self-indexes are the Burrows-Wheeler transform T bwt =-=[3]-=- and function rankc(T bwt , i) that counts how many times symbol c appears in T bwt [1, i]. Function rankc can be efficiently provided by building the wavelet tree [11] on T bwt ; this reduces the pro... |

193 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ... Funded by the Academy of Finland under grant 108219. ⋆⋆ Partially Funded by Fondecyt Grant 1-050493, Chile.counting is done. Similar complexities are obtained with modern compressed data structures =-=[6,11,9]-=-, requiring space nHk + o(n log σ) bits (for some small k), where Hk ≤ log σ is the k-th order empirical entropy of T. These indexes are often called compressed self-indexes refering to their space re... |

191 | Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets
- Raman, Raman, et al.
- 2002
(Show Context)
Citation Context ... ; this reduces the problem to rank queries on binary sequences, which are already studied by Jacobson [14] in his seminal work on compressed data structures. Using a more recent binary rank solution =-=[23]-=- inside wavelet trees, one almost automatically achieves a compressed selfindex taking nH0 + o(n log σ) bits of space [11,9, 16]. Let us call this index Succinct Suffix Array (SSA) following [16]. Wha... |

180 | Opportunistic Data Structures with Application
- Ferrragina, Manzini
- 2000
(Show Context)
Citation Context ... Funded by the Academy of Finland under grant 108219. ⋆⋆ Partially Funded by Fondecyt Grant 1-050493, Chile.counting is done. Similar complexities are obtained with modern compressed data structures =-=[6,11,9]-=-, requiring space nHk + o(n log σ) bits (for some small k), where Hk ≤ log σ is the k-th order empirical entropy of T. These indexes are often called compressed self-indexes refering to their space re... |

170 |
Space-efficient static trees and graphs
- JACOBSON
- 1989
(Show Context)
Citation Context ...bwt [1, i]. Function rankc can be efficiently provided by building the wavelet tree [11] on T bwt ; this reduces the problem to rank queries on binary sequences, which are already studied by Jacobson =-=[14]-=- in his seminal work on compressed data structures. Using a more recent binary rank solution [23] inside wavelet trees, one almost automatically achieves a compressed selfindex taking nH0 + o(n log σ)... |

131 | An analysis of the Burrows-Wheeler transform
- MANZINI
(Show Context)
Citation Context ...ore standard definition. It is anyway known that both definitions do not differ by much [8].w in T and T |w denotes the concatenation of the symbols appearing immediately before those nw occurrences =-=[19]-=-. Substring w = T[i + 1, i + k] is called the kcontext of symbol ti. We take T here as a cyclic string, such that tn precedes t1, and thus the amount of k-contexts is exactly n. 3 Previous Results 3.1... |

119 |
Indexing compressed text
- Ferragina, Manzini
(Show Context)
Citation Context ...ion for technical convenience. If this is an issue, the texts can be handled reversed to obtain results on the more standard definition. It is anyway known that both definitions do not differ by much =-=[8]-=-.w in T and T |w denotes the concatenation of the symbols appearing immediately before those nw occurrences [19]. Substring w = T[i + 1, i + k] is called the kcontext of symbol ti. We take T here as ... |

115 |
The myriad virtues of subword trees
- Apostolico
- 1985
(Show Context)
Citation Context ...f P in T; (b) locate those occ positions in T. There are several classical full-text indexes requiring O(n log n) bits of space which can answer counting queries in O(m log σ) time (like suffix trees =-=[1]-=-) or O(m + log n) time (like suffix arrays [18]). Both locate each occurrence in constant time once the ⋆ Funded by the Academy of Finland under grant 108219. ⋆⋆ Partially Funded by Fondecyt Grant 1-0... |

110 | Compressed representations of sequences and full-text indexes
- Ferragina, Manzini, et al.
(Show Context)
Citation Context ... Funded by the Academy of Finland under grant 108219. ⋆⋆ Partially Funded by Fondecyt Grant 1-050493, Chile.counting is done. Similar complexities are obtained with modern compressed data structures =-=[6,11,9]-=-, requiring space nHk + o(n log σ) bits (for some small k), where Hk ≤ log σ is the k-th order empirical entropy of T. These indexes are often called compressed self-indexes refering to their space re... |

53 | Succinct suffix arrays based on run-length encoding
- MÄKINEN, NAVARRO
(Show Context)
Citation Context ... work on compressed data structures. Using a more recent binary rank solution [23] inside wavelet trees, one almost automatically achieves a compressed selfindex taking nH0 + o(n log σ) bits of space =-=[11,9, 16]-=-. Let us call this index Succinct Suffix Array (SSA) following [16]. What has remained unnoticed so far is that SSA actually takes only nHk + o(n log σ) bits of space. This result makes some of the mo... |

50 | Breaking a time-and-space barrier in constructing full-text indices
- Hon, Sadakane, et al.
- 2003
(Show Context)
Citation Context ...ce essentially equal to the k-th order empirical entropy of the text collection, which in addition can be built within this working space. Alternative dynamic indexes or constructions of self-indexes =-=[6,13,2]-=- achieve at best O(nHk) bits of space (with constants larger than 4), and in many cases worse time complexities. Note also that, from the dynamic index just built, it is very easy to obtain the BWT of... |

49 | Dynamic entropy-compressed sequences and full-text indexes - Mäkinen, Navarro - 2008 |

43 | When indexing equals compression: experiments with compressing suffix arrays and applications
- Grossi, Gupta, et al.
- 2004
(Show Context)
Citation Context ...ding implies that all rows could have been concatenated into a single wavelet tree and the same space would have been achieved. This would greatly simplify the original arrangement. Interestingly, in =-=[12]-=- they find out that, if they use gap encoding over the successive values along a column, and they then concatenate all the columns, the total space is O(nHk) without any table partitioning as well. Bo... |

25 | Succinct dynamic data structures
- Raman, Raman, et al.
- 2001
(Show Context)
Citation Context ... this. Once more, our finding is that this is not really necessary to achieve k-th order compression if the levels of the wavelet tree are represented using the technique of block identifier encoding =-=[22]-=-.6 Application to Space-Efficient Construction of (Dynamic) Self-Indexes Another consequence of our result is that we obtain an O(n log n log σ) time construction algorithm for a compressed self-inde... |

19 | The myriad virtues of wavelet trees
- FERRAGINA, GIANCARLO, et al.
(Show Context)
Citation Context ... without any table partitioning as well. Both results stem from the same fact: the cell entropies can be added in any order to get nHk. Finally, it is interesting to point out that, in a recent paper =-=[5]-=-, the possibility of achieving k-th order compression when applying wavelet trees over the BWT is explored (among many other results), yet they resort to run-length compression to achieve this. Once m... |

18 | Low redundancy in dictionaries with O(1) worst case lookup time
- Pagh
- 1999
(Show Context)
Citation Context ...of Bj of length blk has lj 1 + . . . + lj t ≤ lj bits set. The ) +t ≤ log ( |B j | l j take in the compressed ) +t ≤ |B j |H0(B j )+t where all the inequalities hold by simple combinatorial arguments =-=[21]-=- and have been reviewed in Section 3.1. Note that those B j bit vectors are precisely those that would result if we built the wavelet tree just for L j . According to Theorem 1, adding up those |B j |... |

16 | Compressed index for a dynamic collection of texts
- Chan, Hon, et al.
(Show Context)
Citation Context ...ame nH0 + o(n) space as the static, but supports rank and select, and in addition insertions and deletions of bits, in O(log n) time. This can then be used to improve the dynamic index of Chan et al. =-=[4]-=- to obtain the above result. Exactly the same analysis as in Sect. 4 applies to this dynamic variant, and Theorem 4 is boosted into the following. Corollary 1. There is a data structure maintaining a ... |

14 | Space-efficient construction of LZ-index
- Arroyuelo, Navarro
- 2005
(Show Context)
Citation Context ...ce essentially equal to the k-th order empirical entropy of the text collection, which in addition can be built within this working space. Alternative dynamic indexes or constructions of self-indexes =-=[6,13,2]-=- achieve at best O(nHk) bits of space (with constants larger than 4), and in many cases worse time complexities. Note also that, from the dynamic index just built, it is very easy to obtain the BWT of... |

14 | Fast BWT in small space by blockwise suffix sorting
- Kärkkäinen
- 2007
(Show Context)
Citation Context ...uild the BWT of a text within entropy bounds. The best result in terms of space complexity takes O(n) bits working space, O(n log 2 n) time in the worst case, and O(n log n) time in the expected case =-=[15]-=-. Using O(n log σ) working space, there is a faster algorithm achieving O(n log log σ) time requirement [13]. Finally, one can achieve the optimal O(n) time with the price of O(n log ǫ n log σ) bits o... |

12 | Compression boosting in optimal linear time using the Burrows-Wheeler Transform
- Ferragina, Manzini
- 2004
(Show Context)
Citation Context ...man, and Rao [23]. In the following, we first define the entropy concepts more formally, then explain the encoding in [23], wavelet trees [11], Burrows-Wheeler transform [3], and compression boosting =-=[7]-=- in order to give our new analysis in a self-contained manner. We conclude with the application to space-efficient construction of (dynamic) full-text self-indexes. 2 Definitions We assume our text T ... |

11 | Large alphabets and incompressibility
- Gagie
(Show Context)
Citation Context ... it may seem when one puts numbers to the condition on k and realizes that the achievable k values are rather low. Worse than that, it is unlikely that this theoretical limit can be sensibly improved =-=[10]-=-. Yet, those limits are worst-case, and different methods may not have to pay the Θ(σ k+1 log n) space overhead in practice. For example, in our case, this overhead comes from the fact that we are una... |

9 |
K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(n log n)-bit working space
- Na, Park
- 2007
(Show Context)
Citation Context ...pace, there is a faster algorithm achieving O(n log log σ) time requirement [13]. Finally, one can achieve the optimal O(n) time with the price of O(n log ǫ n log σ) bits of space, for some 0 < ǫ < 1 =-=[20]-=-.7 Final Practical Considerations Our main finding is that all the sophistications [16,9, 11] built over the simple “wavelet tree on top of the BWT” scheme in order to boost its zero-order to high-or... |