## Colored Range Queries and Document Retrieval

### Cached

### Download Links

Citations: | 16 - 9 self |

### BibTeX

@MISC{Gagie_coloredrange,

author = {Travis Gagie and Gonzalo Navarro and Simon J. Puglisi},

title = {Colored Range Queries and Document Retrieval},

year = {}

}

### OpenURL

### Abstract

Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.

### Citations

2368 | Modern Information Retrieval - Baeza-Yates, Ribeiro-Neto - 1999 |

645 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ...hen, given a pattern, we can quickly find the lexicographic ranks i and j of the first and last suffixes starting with the pattern. This is equivalent to finding the range A[i..j] in the suffix array =-=[27]-=- A for T that lists the starting positions of all the suffixes of T that start with the pattern. Once we know i and j, we can implement a DL query as a CRL query on E[i..j]. Altogether, Muthukrishnan’... |

427 |
Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ...d the array E[1, n] such that E[i] is the document containing the startingColored Range Queries and Document Retrieval 3 position of the lexicographically ith suffix in T . If we store a suffix tree =-=[38, 1]-=- for T then, given a pattern, we can quickly find the lexicographic ranks i and j of the first and last suffixes starting with the pattern. This is equivalent to finding the range A[i..j] in the suffi... |

193 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...y noted that, using one select query per occurrence, they can list the positions of the pattern’s occurrences in a specified document. Gagie, Puglisi and Turpin [15] showed that a binary wavelet tree =-=[18]-=- can be used to compute range quantile queries on S in O(log σ) time, and that these queries can be used to enumerate the distinct elements in S[i..j], eliminating the need for RMQs. A binary wavelet ... |

192 | Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ... documents start; then, for 1 ≤ ℓ ≤ n, E[ℓ] = rank1(V, CSA[ℓ]), where rank1(V, r) is the number of 1s in V [1..r]. It takes D log(n/D)+O(D)+o(n) bits to store V such that a rank query takes O(1) time =-=[33]-=-. Sadakane did not store C at all so, when listing the distinct documents containing a pattern, he used a D-bit string to mark which documents he had already listed. He used a recursion similar to Mut... |

178 |
Scaling and related techniques for geometry problems
- Gabow, Bentley, et al.
- 1984
(Show Context)
Citation Context ... i, so that S[ℓ] is the first occurrence of a color in S[i..j] if and only if i ≤ ℓ ≤ j and C[ℓ] < i. He showed how, if we store C in an O(n log n)-bit data structure due to Gabow, Bentley and Tarjan =-=[14]-=- that supports O(1)-time range-minimum queries (RMQs), we can quickly find all the values in C[i..j] less than i and, thus, list all the colors in S[i..j]. To do this, we find the minimum value C[ℓ] i... |

176 | PATRICIA - Practical Algorithm To Retrieve Information Coded - Morrison - 1968 |

172 | Compressed full-text indexes - Navarro, Mäkinen |

131 | An analysis of the Burrows-Wheeler transform
- Manzini
- 2001
(Show Context)
Citation Context ...reported color, matching the time of Muthukrishnan’s O(n log n)-bit space solution [31]. The k-th order empirical entropy Hk(S) measures the compressibility of S when we use contexts of length k; see =-=[28]-=- for details. The frequency of any color can be obtained in time O(log log σ). 6+9: is similar to the above but the n o(log σ) space term is avoided, as the structure by Barbay, Gagie, Navarro and Nek... |

114 |
The myriad virtues of subword tree
- Apostolico
- 1985
(Show Context)
Citation Context ...d the array E[1, n] such that E[i] is the document containing the startingColored Range Queries and Document Retrieval 3 position of the lexicographically ith suffix in T . If we store a suffix tree =-=[38, 1]-=- for T then, given a pattern, we can quickly find the lexicographic ranks i and j of the first and last suffixes starting with the pattern. This is equivalent to finding the range A[i..j] in the suffi... |

111 | Compressed representations of sequences and full-text indexes
- Ferragina, Manzini, et al.
(Show Context)
Citation Context ... specifically, for 1 ≤ ℓ ≤ n, C[ℓ] = select S[ℓ](S, rank S[ℓ](S, ℓ) − 1), where selecta(S, r) is the position of the rth occurrence of a in S. Välimäki and Mäkinen stored S in a multiary wavelet tree =-=[10]-=-, which takes nH0(S)+o(n) log σ bits and O(1 + log σ/ log log n) time; when σ is polylogarithmic in n, it takes nH0(S) + o(n) bits and O(1) time. The 0-th order empirical entropy H0(S) = ∑ occ(a,S) n ... |

98 | A survey of top-k query processing techniques in relational database systems
- Ilyas, Beskales, et al.
(Show Context)
Citation Context ...ons to this problem, although finding the k most frequent or important items in various data sets and models is a well studied problem and there has been work on interesting special cases (see, e.g., =-=[22, 24]-=-). Greve, Jørgensen, Larsen and Truelsen [17] recently gave a data structure that, for any ɛ > 0, stores S in O((n/ɛ) log n) bits such that we can find an element such that no element is more than 1 +... |

87 |
New text indexing functionalities of the compressed suffix arrays
- Sadakane
- 2003
(Show Context)
Citation Context ... same compression algorithm [16] to E we obtain at most (R + D) log(n/(R + D)) rules and hence the space given in the theorem. ⊓⊔ As a final note applying only to document collections, Sadakane’s CSA =-=[34]-=- essentially represents a function Ψ such that A[Ψ(i)] = A[i]+1, which is stored in compressed form and any value computed in constant time. Thus one advances virtually in the text by successively app... |

73 | Efficient algorithms for document retrieval problems, in
- Muthukrishnan
(Show Context)
Citation Context ...d be, for example, the minimum or maximum value in S[i, j] [12], the element with a specified rank in sorted order [15] (e.g., the median [7]), the mode [17], a complete list of the distinct elements =-=[31]-=-, the frequencies of the elements [35], a list of the k most frequent elements for a given k [20], or the number of distinct elements [6]. In this paper, motivated by problems in document retrieval, w... |

53 | Succinct suffix arrays based on run-length encoding
- MÄKINEN, NAVARRO
(Show Context)
Citation Context ...ere is some other area A[j..j + ℓ] where A[j + k] = A[i + k] + 1 for all 0 ≤ k ≤ ℓ. Let R be the number of runs with which the SA can be covered; it is known that R ≤ min(n, nHk(T ) + σ k ) for any k =-=[25]-=-. González and Navarro represent the SA differentially so that these areas become true repetitions, and use a grammar-based compression algorithm that represents A using at most R log(n/R) rules. We n... |

52 | Succinct data structures for flexible text retrieval systems
- Sadakane
- 2007
(Show Context)
Citation Context ...mum value in S[i, j] [12], the element with a specified rank in sorted order [15] (e.g., the median [7]), the mode [17], a complete list of the distinct elements [31], the frequencies of the elements =-=[35]-=-, a list of the k most frequent elements for a given k [20], or the number of distinct elements [6]. In this paper, motivated by problems in document retrieval, we consider the latter three kinds of p... |

49 | Practical entropy-compressed rank/ select dictionary - Okanohara, Sadakane - 2007 |

41 | A new succinct representation of rmq-information and improvements in the enhanced suffix array - Fischer, Heun - 2007 |

41 | Succinct indexes for strings, binary relations and multilabeled trees - BARBAY, HE, et al. - 2011 |

39 |
Inverted index compression and query processing with optimized document ordering
- Yan, Ding, et al.
- 2009
(Show Context)
Citation Context ...suggests. For example, if the documents are webpages sorted lexicographically by URL, then it is more likely that interesting patterns will occur often in clusters of documents than widely spread out =-=[36, 39]-=-. In this case, leaves in a balanced wavelet tree for E that are labelled with the k distinct documents that contain the pattern most often, are likely to share many ancestors; if so, our data structu... |

36 | Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387 - Golynski - 2007 |

34 | Fully-functional succinct trees - Sadakane, Navarro - 2010 |

33 | A simple storage scheme for strings achieving entropy bounds
- Ferragina, Venturini
- 2007
(Show Context)
Citation Context ...), the latter term for computing frequencies. For example, 2+9: is Välimäki and Mäkinen’s scheme [37]. 1: is the scheme by Gagie, Puglisi, and Turpin [15]. 3+9+10: combining Ferragina and Venturini’s =-=[11]-=- data structure with Fischer’s [12] succinct index for RMQ and Grossi, Orlandi and Raman’s [19] succinct index for rank gives a solution for CRL that takes nHk(S) + 2n + o(n) log σ + n o(log σ) bits a... |

33 | Space-efficient algorithms for document retrieval, in
- Välimäki, Mäkinen
(Show Context)
Citation Context ...ng (with or without color frequencies), colored range top-k queries, and colored range counting. These have been associated, respectively, to very relevant document retrieval queries on general texts =-=[31, 35, 37, 20, 15, 12, 9]-=-: listing the documents where a pattern appears (possibly computing ⋆ Partially funded by the Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile.2 T... |

32 | Rank and select revisited and extended
- Mäkinen, Navarro
(Show Context)
Citation Context ... G. Navarro, and S. J. Puglisi Table 1. Space and time bounds for some data structures supporting operations on S[1, n] over [1, σ]. The O(σ log n) extra bits of wavelet trees [18, 10] can be avoided =-=[26]-=- so we have not included it. The space bound in rows 3 and 6 holds for k = o(log σ n). In rows 7 and 8, g is the size (in bits) of a given context-free grammar generating S and only S and α is the inv... |

30 |
Generalized Intersection Searching Problems
- Janardan, Lopez
- 1993
(Show Context)
Citation Context ...kly list all the distinct elements (“colors”) in that range. Almost all recent data structures for CRL (and the related problem of document listing) are based on a key idea by Muthukrishnan [31] (see =-=[23]-=- for older work). He defined C[1, n] to be the array in which C[j] is the largest value i < j such that S[i] = S[j], or 0 if there is no such i, so that S[ℓ] is the first occurrence of a color in S[i.... |

29 |
P.V.: An implicit binomial queue with constant insertion time
- Carlsson, Munro, et al.
- 1988
(Show Context)
Citation Context ... to the queue. There are always O(k log σ) nodes in the queue (the tree is of height O(log σ)) so, if we use a priority queue allowing O(log(k log σ)) = O(log σ) time deletion and O(1) time insertion =-=[8]-=-, then we can find the k most frequent elements in S in O(k log σ log(1/ɛ)) time. We can deal with general i and j by using the wavelet tree to compute the appropriate range in each subsequence [26]. ... |

29 | Implicit compression boosting with applications to self-indexing - MÄKINEN, NAVARRO |

26 | Range quantile queries: Another virtue of wavelet trees
- Gagie, Puglisi, et al.
(Show Context)
Citation Context ...ents two indices i and j and returns information about S[i, j]. This information could be, for example, the minimum or maximum value in S[i, j] [12], the element with a specified rank in sorted order =-=[15]-=- (e.g., the median [7]), the mode [17], a complete list of the distinct elements [31], the frequencies of the elements [35], a list of the k most frequent elements for a given k [20], or the number of... |

24 | Compressed text indexes with fast locate
- GONZÁLEZ, NAVARRO
(Show Context)
Citation Context ...)+9+10: combines Bille, Landau and Weimann’s [5] grammar-based data structure for access, Fischer’s [12] succinct index for RMQ, and Grossi et al.’s [19] succinct index for rank. González and Navarro =-=[16]-=- showed how to build a grammar generating an array that, together with some other small data structures, gives access to the suffix array (SA) A. Building Bille, Landau and Weimann’s data structure fo... |

24 | Space-efficient framework for top-k string retrieval problems, in
- Hon, Shah, et al.
(Show Context)
Citation Context ...k in sorted order [15] (e.g., the median [7]), the mode [17], a complete list of the distinct elements [31], the frequencies of the elements [35], a list of the k most frequent elements for a given k =-=[20]-=-, or the number of distinct elements [6]. In this paper, motivated by problems in document retrieval, we consider the latter three kinds of problems, which are often referred to as “colored” range que... |

23 | Sorting out the document identifier assignment problem
- Silvestri
- 2007
(Show Context)
Citation Context ...suggests. For example, if the documents are webpages sorted lexicographically by URL, then it is more likely that interesting patterns will occur often in clusters of documents than widely spread out =-=[36, 39]-=-. In this case, leaves in a balanced wavelet tree for E that are labelled with the k distinct documents that contain the pattern most often, are likely to share many ancestors; if so, our data structu... |

21 | Top-k ranked document search in general text databases
- Culpepper, Navarro, et al.
(Show Context)
Citation Context ...ng (with or without color frequencies), colored range top-k queries, and colored range counting. These have been associated, respectively, to very relevant document retrieval queries on general texts =-=[31, 35, 37, 20, 15, 12, 9]-=-: listing the documents where a pattern appears (possibly computing ⋆ Partially funded by the Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile.2 T... |

21 | Optimal succinctness for range minimum queries, in
- Fischer
(Show Context)
Citation Context ... a sequence S[1, n] of elements in [1, σ] takes as arguments two indices i and j and returns information about S[i, j]. This information could be, for example, the minimum or maximum value in S[i, j] =-=[12]-=-, the element with a specified rank in sorted order [15] (e.g., the median [7]), the mode [17], a complete list of the distinct elements [31], the frequencies of the elements [35], a list of the k mos... |

18 | Alphabet partitioning for compressed rank/select and applications
- BARBAY, GAGIE, et al.
- 2010
(Show Context)
Citation Context ...details. The frequency of any color can be obtained in time O(log log σ). 6+9: is similar to the above but the n o(log σ) space term is avoided, as the structure by Barbay, Gagie, Navarro and Nekrich =-=[4]-=- computes rank as well. This becomes the least-space reported solution to CRL, listing in O(1) time. (4 or 5)+9: combining Barbay et al.’s [4] access and rank data structure with Fischer’s [12] succin... |

14 | Bounding the inefficiency of lengthrestricted prefix codes
- Milidiú, Laber
- 2001
(Show Context)
Citation Context ... log n) bits. However, since a Huffman tree can be very deep (height n−1 for a very skewed distribution), this would compromise our time bound. Therefore, we use a an O(log σ)-restricted Huffman tree =-=[30]-=-, which yields both the space and time bounds we want. Theorem 3. Given a sequence S[1, n] over an alphabet of size σ and a constant ɛ > 0, we can store S in O((n/ɛ)(H0(S) + 1) log n) bits such that, ... |

13 | Top-k document retrieval in optimal time and linear space - NAVARRO, NEKRICH - 2012 |

13 | Range mode and range median queries in constant time and sub-quadratic - Petersen, Grabowski - 2009 |

12 | New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science - GAGIE, NAVARRO, et al. - 2012 |

11 | Applications of web query mining
- Baeza-Yates
(Show Context)
Citation Context ...s not need access to the underlying data (i.e., C), but the succinct index for rank in row 10 does (i.e., S), hence the time of the latter depends on tacc. Due to space constraints, here we write log =-=[2]-=- and log [3] for log log and log log log. row source space (in bits) tacc tenum trank 1 [18, 15] nH0(S) + o(n) log σ O(log σ) O(log σ) O(log σ) ( ) ( 2 [10, Cor. 3.3] nH0(S) + o(n) log σ O 1 + O 1 + l... |

11 | Cell probe lower bounds and approximations for range mode
- Greve, Jørgensen, et al.
- 2010
(Show Context)
Citation Context ...nformation about S[i, j]. This information could be, for example, the minimum or maximum value in S[i, j] [12], the element with a specified rank in sorted order [15] (e.g., the median [7]), the mode =-=[17]-=-, a complete list of the distinct elements [31], the frequencies of the elements [35], a list of the k most frequent elements for a given k [20], or the number of distinct elements [6]. In this paper,... |

11 | On the redundancy of succinct data structures - Golynski, Raman, et al. - 2008 |

10 | Random access to grammar-compressed strings
- Bille, Landau, et al.
- 2011
(Show Context)
Citation Context ...be less or more. We can then also discard the D-bit string marking documents used by both solutions [35, 20] and replace it with rank queries on E. (7 or 8)+9+10: combines Bille, Landau and Weimann’s =-=[5]-=- grammar-based data structure for access, Fischer’s [12] succinct index for RMQ, and Grossi et al.’s [19] succinct index for rank. González and Navarro [16] showed how to build a grammar generating an... |

10 |
Tsakalidis. New upper bounds for generalized intersection searching problems
- Bozanis, Kitsios, et al.
(Show Context)
Citation Context ...[7]), the mode [17], a complete list of the distinct elements [31], the frequencies of the elements [35], a list of the k most frequent elements for a given k [20], or the number of distinct elements =-=[6]-=-. In this paper, motivated by problems in document retrieval, we consider the latter three kinds of problems, which are often referred to as “colored” range queries: colored range listing (with or wit... |

10 | Augmenting suffix trees, with applications
- Matias, Muthukrishnan, et al.
- 1998
(Show Context)
Citation Context ...t listing (DL), in which we are given a library of documents and asked to preprocess them such that later, given a pattern, we can quickly list all the distinct documents containing that pattern (see =-=[29]-=- for older work). Let T [1, n] be the concatenation of the D documents. Muthukrishnan defined the array E[1, n] such that E[i] is the document containing the startingColored Range Queries and Documen... |

10 | Improved compressed indexes for full-text document retrieval - BELAZZOUGUI, NAVARRO, et al. - 2012 |

10 | Practical compressed document retrieval, in - Navarro, Puglisi, et al. |

9 |
Efficient index for retrieving top-k most frequent documents
- Hon, Shah, et al.
- 2009
(Show Context)
Citation Context ...n and D, this can be an interesting alternative to using lookup and marking the document beginnings [35]. 3 Top-k Queries Improving the current-best solution for documents. Recently, Hon, Shah and Wu =-=[21]-=- described a data structure that stores a library T of D documents of total length n in O ( n log 2 n ) bits such that later, given a pattern of length m and an integer k ≥ 1, we can find the k docume... |

9 | New lower and upper bounds for representing sequences - Belazzougui, Navarro - 2012 |

9 | Efficient colored orthogonal range counting - Kaplan, Rubin, et al. |