Results 1 -
6 of
6
Colored Range Queries and Document Retrieval
"... Abstract. Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract. Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence. 1
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.
Improved grammar-based compressed indexes
- In Proc. 19th SPIRE, LNCS 7608
, 2012
"... Abstract. We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (meas ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + ɛ n lg n + o(N lg n) bits of space, for any 0 < ɛ ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in O (m 2 /ɛ) lg lg u lg n + (m + occ) lg n time, and extract any substring of length ℓ of T in time O(ℓ + h lg(N/h)), where h is the height of the grammar tree.
Optimal Dynamic Sequence Representations ∗
"... We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worst-case for the queries and amortized for the updates. This complexity is bette ..."
Abstract
- Add to MetaCart
We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worst-case for the queries and amortized for the updates. This complexity is better than the best previous ones by a Θ(1 + lg σ / lg lg n) factor. Our structure uses nH0(S) + O(n + σ(lg σ + lg 1+ε n)) bits, where H0(S) is the zero-order entropy of S and 0 < ε < 1 is any constant. This space redundancy over nH0(S) is also better, almost always, than that of the best previous dynamic structures, o(n lg σ)+O(σ(lg σ+lg n)). We can also handle general alphabets in optimal time, which has been an open problem in dynamic sequence representations. 1
Colored Range Queries and Document Retrieval 1
"... Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper, we give improved time and space bounds for three important one-dimensional colored range queries — colore ..."
Abstract
- Add to MetaCart
Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper, we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, as a consequence, new bounds for various document retrieval problems on general collections of sequences. Colored range listing is the problem of preprocessing a sequence S [1, n] of colors so that, later, given an interval [i, i + ℓ − 1], we list the different colors in S [i, i + ℓ − 1]. Colored range top-k queries ask instead for k most frequent colors in the interval. Colored range counting asks for the number of different colors in the interval. We first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the first compressed data structure (using nHk(S) + o(n log σ) bits, for any k = o(log σ n), where Hk(S) is the k-th order empirical entropy of S and σ the number of different colors in S) that answers colored range listing queries in constant time per returned result. We also give an efficient data structure for document listing whose size is bounded in terms of the k-th order entropy of the library of documents. We then show how (approximate) colored top-k
Smaller Self-Indexes for Natural Language ⋆
"... Abstract. Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding the ..."
Abstract
- Add to MetaCart
Abstract. Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved. 1

