Results 1  10
of
36
Practical rank/select queries over arbitrary sequences
 In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the cas ..."
Abstract

Cited by 49 (26 self)
 Add to MetaCart
(Show Context)
Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation for bit sequences proposed by Raman, Raman, and Rao [SODA 2002], that is competitive with the existing ones when the sequences are not too compressible. It also has nice local compression properties, and we show that this makes it an excellent tool for compressed text indexing in combination with the BurrowsWheeler transform. This shows the practicality of a recent theoretical proposal [Mäkinen and Navarro, SPIRE 2007], achieving spaces never seen before. Second, for general sequences, we tune wavelet trees for the case of very large alphabets, by removing their pointer information. We show that this gives an excellent solution for representing a sequence within zeroorder entropy space, in cases where the large alphabet poses a serious challenge to typical encoding methods. We also present the first implementation of Golynski et al.’s representation [SODA 2006], which offers another interesting time/space tradeoff. 1
Storage and Retrieval of Highly Repetitive Sequence Collections
, 2009
"... A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis ..."
Abstract

Cited by 34 (16 self)
 Add to MetaCart
(Show Context)
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits inmemory analyses. Recent advances in fulltext indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longerterm repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic fulltext indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fullycompressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 32 (12 self)
 Add to MetaCart
(Show Context)
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 31 (18 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
Compressed representations of permutations, and applications
 SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE
"... We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases mat ..."
Abstract

Cited by 30 (17 self)
 Add to MetaCart
We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications π k (i) of it, of integer functions, and of inverted lists and suffix arrays.
New algorithms on wavelet trees and applications to information retrieval
 Theoretical Computer Science
, 2012
"... ar ..."
(Show Context)
Practical Compressed Document Retrieval
"... Recent research on document retrieval for general texts has established the virtues of explicitly representing the socalled document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of ..."
Abstract

Cited by 19 (16 self)
 Add to MetaCart
(Show Context)
Recent research on document retrieval for general texts has established the virtues of explicitly representing the socalled document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of redundant space and is not easily compressible. In this paper we present the first practical proposal to compress the document array. We show that the resulting structureis significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature. We also compare various known algorithms for document listing and topk retrieval, and find that the most useful combinations of algorithms run over our new compressed document arrays.
Compressed permuterm index
 In Proceedings 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
"... The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dic ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
(Show Context)
The Permuterm index (Garfield, 1976) is a timeefficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wildcard symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the kth order empirical entropy of the indexed dictionary. We also design a dynamic version of this index which allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings. The result is based on a simple variant of the BurrowsWheeler Transform defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [17]. We will complement our theoretical study with a rich set of experiments which show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip2. This improves known approaches based on FrontCoding [19] by more than 50 % in absolute space occupancy, still guaranteeing comparable query time.