Results 1  10
of
11
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Efficient FullyCompressed Sequence Representations
, 2010
"... We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worstcase time O (lg lg σ) and average time O (lg H0(s)). The worstcase complexity matches the best previous results, yet these had been achieved with data structures using nH0(s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our averagecase complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing fulltext selfindexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
Faster Topk Document Retrieval in Optimal Space ⋆
"... Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer quer ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal CSA+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus O(k lg 2 k lg ɛ n) accesses to suffix array cells, for any constant ɛ> 0. This is lg n / lg k times faster than the only previous solution using optimal space, lg k times slower than the fastest structure that uses twice the space, and lg 2 k lg ɛ n times the lowerbound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest. 1
Compressed String Dictionary Lookup with Edit Distance One
"... Abstract. In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern P, the index has to report all the strings in the dictionary having edit distance at most one with P. Our first solution is able to solve queries in (almost ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern P, the index has to report all the strings in the dictionary having edit distance at most one with P. Our first solution is able to solve queries in (almost optimal) O(P  + occ) time where occ is the number of strings in the dictionary having edit distance at most one with P. The space complexity of this solution is bounded in terms of the kth order entropy of the indexed dictionary. Our second solution further improves this space complexity at the cost of increasing the query time. 1
A LempelZiv Compressed Structure for Document Listing ⋆
"... Abstract. Document listing is the problem of preprocessing a set of sequences, called documents, so that later, given a short string called the pattern, we retrieve the documents where the pattern appears. While optimaltime and linearspace solutions exist, the current emphasis is in reducing the s ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. Document listing is the problem of preprocessing a set of sequences, called documents, so that later, given a short string called the pattern, we retrieve the documents where the pattern appears. While optimaltime and linearspace solutions exist, the current emphasis is in reducing the space requirements. Current document listing solutions build on compressed suffix arrays. This paper is the first attempt to solve the problem using a LempelZiv compressed index of the text collections. We show that the resulting solution is very fast to output most of the resulting documents, taking more time for the final ones. This makes this index particularly useful for interactive scenarios or when listing some documents is sufficient. Yet, it also offers a competitive space/time tradeoff when returning the full answers. 1
From Time to Space: Fast Algorithms that yield Small and Fast Data Structures
"... Abstract. In many cases, the relation between encoding space and execution time translates into combinatorial lower bounds on the computational complexity of algorithms in the comparison or external memory models. We describe a few cases which illustrate this relation in a distinct direction, where ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. In many cases, the relation between encoding space and execution time translates into combinatorial lower bounds on the computational complexity of algorithms in the comparison or external memory models. We describe a few cases which illustrate this relation in a distinct direction, where fast algorithms inspire compressed encodings or data structures. In particular, we describe the relation between searching in an ordered array and encoding integers; merging sets and encoding a sequence of symbols; and sorting and compressing permutations.
Document Listing on Repetitive Collections
"... Abstract. Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the runlength ..."
Abstract
 Add to MetaCart
Abstract. Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the runlength compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our additional structures on top of the RLCSA can reduce the query time for document listing by an order of magnitude while still using total space that is only a fraction of the raw collection size. As a byproduct, we develop a new document listing technique for general collections that is of independent interest. 1
Better Space Bounds for Parameterized Range Majority and Minority
"... Abstract. Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold τ. ..."
Abstract
 Add to MetaCart
Abstract. Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold τ. Subsequent authors have reduced their time and space bounds such that, when τ is given at preprocessing time, we need either O(n lg(1/τ)) space and optimal O(1/τ) query time or linear space and O((1/τ) lg lg σ) query time, where σ is the alphabet size. In this paper we give the first linearspace solution with optimal O(1/τ) query time. For the case when τ is given at query time, we significantly improve previous bounds, achieving either O(n lg lg σ) space and optimal O(1/τ) query time or compressed space and O ( (1/τ) lg lg(1/τ) query time. Along the lg lg n way, we consider the complementary problem of parameterized range minority that was recently introduced by Chan et al. (2012), who achieved linear space and O(1/τ) query time even for variable τ. We improve their solution to use either nearly optimally compressed space with no slowdown, or optimally compressed space with nearly no slowdown. Some of our intermediate results, such as densitysensitive query time for onedimensional range counting, may be of independent interest. 1