## Improved compressed indexes for full-text document retrieval (2011)

Venue: | IN PROC. 18TH SPIRE |

Citations: | 10 - 7 self |

### BibTeX

@INPROCEEDINGS{Belazzougui11improvedcompressed,

author = {Djamal Belazzougui and Gonzalo Navarro},

title = {Improved compressed indexes for full-text document retrieval},

booktitle = {IN PROC. 18TH SPIRE},

year = {2011},

pages = {286--297},

publisher = {}

}

### OpenURL

### Abstract

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least |CSA | + O(n) or 2|CSA | + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA | + O(n lg lg lg D) bits. We also improve current solutions that use 2|CSA | + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies.

### Citations

645 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...ull-text index [17] is used as the base data structure. This is usually a compressed suffix array of T (we call this structure CSA and its bit space |CSA|). The CSA simulates the suffix array A[1, n] =-=[13]-=-, where A[i] points to the ith lexicographically smallest suffix in T . The CSA finds the interval A[sp, ep] of occurrences of P in time tsearch, usually O(m lg σ) or less [9, 5]. It can ⋆ Partially f... |

193 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...he suffix array A[1, n] [13], where A[i] points to the ith lexicographically smallest suffix in T . The CSA finds the interval A[sp, ep] of occurrences of P in time tsearch, usually O(m lg σ) or less =-=[9, 5]-=-. It can ⋆ Partially funded by Fondecyt Grant 1-110066, Chile. First author also partially supported by the French ANR-2010-COSI-004 MAPPI Project.also compute any cell A[i], and even A −1 [i], in ti... |

192 | Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ... bit b ∈ {0, 1} in B[1, i], whereas selectb(B, j) is the position of the jth occurrence of bit b in B. There exists a representation for B using lg ( ) n n m +O(lg lg m)+o(n) = m lg m +O(m)+o(n) bits =-=[21]-=-, solving both operations in constant time. As B can be reconstructed using operation rank, this space is asymptotically optimal. A mmphf can be seen as a weaker structure on B, able to answer only ra... |

172 | Compressed full-text indexes
- Navarro, Mäkinen
(Show Context)
Citation Context ...es: List the distinct documents where P appears, and the frequency (number of occurrences) of P in each. Top-k retrieval: List the k documents where P appears most times. A compressed full-text index =-=[17]-=- is used as the base data structure. This is usually a compressed suffix array of T (we call this structure CSA and its bit space |CSA|). The CSA simulates the suffix array A[1, n] [13], where A[i] po... |

131 | An analysis of the Burrows-Wheeler transform
- Manzini
- 2001
(Show Context)
Citation Context ... > 0. These indexes represent the text and the suffix array within as little as nHh(T ) + o(n lg σ) bits, for any h ≤ α lg σ n and constant α < 1. Here Hh(T ) is the empirical h-th order entropy of T =-=[14]-=-, a lower bound on the bits-per-symbol a statistical order-h compressor may achieve on T . In the rest of the section we describe our contributions in context. We introduce at this point the concepts ... |

114 |
The myriad virtues of subword tree
- Apostolico
- 1985
(Show Context)
Citation Context ...ular sorting problem, which can be of independent interest. Table 1 summarizes our results on this part. 1.2 Top-k Document Retrieval The pioneering work of Hon et al. [11] uses a sampled suffix tree =-=[1]-=- of o(n) extra bits to reduce this problem to that of accessing E[i] and computing arbitrary frequencies (document listing with frequencies turns out to be a simpler problem). They achieve time O(tsea... |

111 | Compressed representations of sequences and full-text indexes
- Ferragina, Manzini, et al.
(Show Context)
Citation Context ...he suffix array A[1, n] [13], where A[i] points to the ith lexicographically smallest suffix in T . The CSA finds the interval A[sp, ep] of occurrences of P in time tsearch, usually O(m lg σ) or less =-=[9, 5]-=-. It can ⋆ Partially funded by Fondecyt Grant 1-110066, Chile. First author also partially supported by the French ANR-2010-COSI-004 MAPPI Project.also compute any cell A[i], and even A −1 [i], in ti... |

87 | New text indexing functionalities of the compressed suffix arrays - Sadakane - 2003 |

84 |
Log-logarithmic worst-case range queries are possible in space θ(n), Information Processing Letters 17 (2
- Willard
- 1983
(Show Context)
Citation Context ...recomputed candidates and the (at most 2b) distinct documents mentioned in these remaining intervals, compute their frequencies in A[sp, ep], and take the k highest frequencies. By using y-fast tries =-=[25]-=- on the identifiers and on the frequencies, the process takes time O(topb), where top = tSA + tcount + lg lg n and tcount is the time to count an arbitrary frequency (the lg lg n will be absorbed by a... |

73 | Efficient algorithms for document retrieval problems, in
- Muthukrishnan
(Show Context)
Citation Context ... general sequences over alphabet [1, σ]), concatenated into a text T [1, n], preprocess T so as to later answer various queries of significance in IR. The problem has received much attention recently =-=[16, 22, 24, 11, 8, 7, 4, 12]-=- as a natural evolution over plain full-text indexing (which just counts and locates all the individual occurrences of a pattern P [1, m] in T ) and for its applications in IR on Oriental languages, s... |

52 | Succinct data structures for flexible text retrieval systems
- Sadakane
- 2007
(Show Context)
Citation Context ... general sequences over alphabet [1, σ]), concatenated into a text T [1, n], preprocess T so as to later answer various queries of significance in IR. The problem has received much attention recently =-=[16, 22, 24, 11, 8, 7, 4, 12]-=- as a natural evolution over plain full-text indexing (which just counts and locates all the individual occurrences of a pattern P [1, m] in T ) and for its applications in IR on Oriental languages, s... |

49 | Practical entropy-compressed rank/ select dictionary
- Okanohara, Sadakane
- 2007
(Show Context)
Citation Context ...er xi of the sampled tree node lca(pi, pi+1). As xi ≥ xi−1, values xi + i are increasing, and thus can be stored in a structure of (n/b) lg 2n n/b + O(n/b) bits that retrieves any xi in constant time =-=[19]-=- 4 lg k+lg lg n . This space is O((n/b) lg b) = O(n ) = o(n). Now we can k lg D lg(D/k) lgε n find in constant time the lowest sampled node covering chunk interval [L, R] as lca(preorder −1 (xL), preo... |

34 | Fully-functional succinct trees
- Sadakane, Navarro
- 2010
(Show Context)
Citation Context ... that if the trees are stored using pointers, then there is a component of O((n/b) lg n) bits for k = 1, and thus ℓ must be at least lg 1+ε n. To avoid this we store the sampled tree in succinct form =-=[23]-=- using just 2 + o(1) bits per node and supporting in O(1) time many operations, including lca, preorder (whose consecutive values are used to index an array storing the top-k candidate data on each no... |

33 | Space-efficient algorithms for document retrieval, in
- Välimäki, Mäkinen
(Show Context)
Citation Context ... general sequences over alphabet [1, σ]), concatenated into a text T [1, n], preprocess T so as to later answer various queries of significance in IR. The problem has received much attention recently =-=[16, 22, 24, 11, 8, 7, 4, 12]-=- as a natural evolution over plain full-text indexing (which just counts and locates all the individual occurrences of a pattern P [1, m] in T ) and for its applications in IR on Oriental languages, s... |

26 | Range quantile queries: Another virtue of wavelet trees
- Gagie, Puglisi, et al.
(Show Context)
Citation Context ...ent i is the ith most important in the collection. Then the problem becomes that of finding the k smallest distinct values in E[sp, ep]. While methods based on range quantile queries on wavelet trees =-=[8]-=- naturally report the documents in sorted order and thus automatically solve this problem in O(k lg D) time by pruning the process after reporting k results, the situation is not that easy for the oth... |

24 | Space-efficient framework for top-k string retrieval problems, in
- Hon, Shah, et al.
(Show Context)
Citation Context |

21 | Top-k ranked document search in general text databases
- Culpepper, Navarro, et al.
(Show Context)
Citation Context |

21 | Optimal succinctness for range minimum queries, in
- Fischer
(Show Context)
Citation Context ...ent listing algorithm [16] within time O(tSA) per document reported, in addition to the time tsearch. The total space is |CSA| + O(n), the latter coming from range minimum query (RMQ) data structures =-=[6]-=-. The space was made succinct by Hon et al. [11], by sparsifying the RMQ structures over array blocks of size lg ε n, so that the time raises to O(tSA lg ε n) and the space decreases to |CSA| + o(n) +... |

19 | Monotone minimal perfect hashing: searching a sorted table with o(1) accesses
- Belazzougui, Boldi, et al.
(Show Context)
Citation Context ... mmphf can be represented within less space than the previous lower bound: within O(m lg lg n m ) bits it answers the limited rank query in constant time, and using O(m lg lg lg n m ) bits it takes ) =-=[2]-=-. time O(lg lg n m 1.1 Document Listing with Frequencies The pioneering work in this area [16] defines a document array E[1, n], where E[i] tells the document to which suffix A[i] belongs. As noted by... |

16 | Colored range queries and document retrieval, in
- Gagie, Navarro, et al.
(Show Context)
Citation Context |

14 | Alphabet-independent compressed text indexing, in - Belazzougui, Navarro |

12 | Theory and practise of monotone minimal perfect hashing
- Belazzougui, Boldi, et al.
- 2009
(Show Context)
Citation Context ...time adds up to O(tSAk lg k lg ε n). 6 Final Remarks A natural next step is to implement these solutions. Many of our improvements are easy to implement, and practical implementations of mmphfs exist =-=[3]-=-. A recent empirical work [18] shows that the individual CSAd’s pose much space overhead, at least if implemented naively. Instead, they compress wavelet trees to 7-17 bpc (bits per text character), c... |

10 | Practical compressed document retrieval, in
- Navarro, Puglisi, et al.
(Show Context)
Citation Context ...g ε n). 6 Final Remarks A natural next step is to implement these solutions. Many of our improvements are easy to implement, and practical implementations of mmphfs exist [3]. A recent empirical work =-=[18]-=- shows that the individual CSAd’s pose much space overhead, at least if implemented naively. Instead, they compress wavelet trees to 7-17 bpc (bits per text character), compared to the 4.5-6.0 bpc of ... |

7 |
Optimal trade-offs for succinct string indexes, in
- Grossi, Orlandi, et al.
(Show Context)
Citation Context ... need an auxiliary mechanism to compute the cells of E are able to reduce lg D space. For example, they achieved O(n lg lg D ) bits with O(tSA lg lg D) time by using a succinct index by Grossi et al. =-=[10]-=-. The very same lower bounds on sequence rank given by Grossi et al. show that this tradeoff is optimal. Our first major contribution improves upon this apparent lower bound. We obtain a succinct inde... |

6 | Top-k color queries for document retrieval, in - Karpinski, Nekrich |

3 | Modern Information Retrieval, 2nd Edition - Baeza-Yates, Ribeiro - 2011 |