Results 1  10
of
63
Practical rank/select queries over arbitrary sequences
 In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the cas ..."
Abstract

Cited by 49 (26 self)
 Add to MetaCart
(Show Context)
Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation for bit sequences proposed by Raman, Raman, and Rao [SODA 2002], that is competitive with the existing ones when the sequences are not too compressible. It also has nice local compression properties, and we show that this makes it an excellent tool for compressed text indexing in combination with the BurrowsWheeler transform. This shows the practicality of a recent theoretical proposal [Mäkinen and Navarro, SPIRE 2007], achieving spaces never seen before. Second, for general sequences, we tune wavelet trees for the case of very large alphabets, by removing their pointer information. We show that this gives an excellent solution for representing a sequence within zeroorder entropy space, in cases where the large alphabet poses a serious challenge to typical encoding methods. We also present the first implementation of Golynski et al.’s representation [SODA 2006], which offers another interesting time/space tradeoff. 1
Faster EntropyBounded Compressed Suffix Trees
, 2009
"... Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix t ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
Alphabetindependent compressed text indexing
 In ESA
, 2011
"... Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract

Cited by 25 (17 self)
 Add to MetaCart
Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabetindependence in the time complexities of selfindexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
Textual data compression in computational biology: a synopsis
 Bioinformatics
, 2009
"... Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. Availability: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The supplementary material (see next) provides pointers to software and benchmark datasets for a range of applications of broad interest. Contact:
Compressing and indexing labeled trees, with applications
, 2009
"... Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any chi ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Consider an ordered, static tree T where each node has a label from alphabet �. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,...) as well as more sophisticated pathbased search operations over its labeled structure. We present a novel approach to this problem by designing what we call the XBWtransform of the tree in the spirit of the wellknown BurrowsWheeler transform for strings [1994]. The XBWtransform uses pathsorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBWtransform, our compressed indexes go beyond the informationtheoretic lower bound, and support navigational and pathsearch operations over labeled trees within (near)optimal time bounds and entropybounded space. Our XBWtransform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBWtransform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by stateoftheart tools, and its query time performance is order
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Practical Compressed Document Retrieval
"... Recent research on document retrieval for general texts has established the virtues of explicitly representing the socalled document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of ..."
Abstract

Cited by 19 (16 self)
 Add to MetaCart
Recent research on document retrieval for general texts has established the virtues of explicitly representing the socalled document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of redundant space and is not easily compressible. In this paper we present the first practical proposal to compress the document array. We show that the resulting structureis significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature. We also compare various known algorithms for document listing and topk retrieval, and find that the most useful combinations of algorithms run over our new compressed document arrays.
Directly Addressable VariableLength Codes
"... We introduce a symbol reordering technique that implicitly synchronizes variablelength codes, such that it is possible to directly access the ith codeword without need of any sampling method. The technique is practical and has many applications to the representation of ordered sets, sparse bitmaps ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
(Show Context)
We introduce a symbol reordering technique that implicitly synchronizes variablelength codes, such that it is possible to directly access the ith codeword without need of any sampling method. The technique is practical and has many applications to the representation of ordered sets, sparse bitmaps, partial sums, and compressed data structures for suffix trees, arrays, and inverted indexes, to name just a few. We show experimentally that the technique offers a competitive alternative to other data structures that handle this problem.
Stronger LempelZiv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a more spaceefficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZindex of Navarro is a compressed fulltext selfindex able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the kth order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worstcase time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other stateoftheart alternatives. In this paper we present stronger LempelZiv based indices, improving the overall performance of the LZindex. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZindices. We simultaneously improve the search time to
Compressed String Dictionaries
"... The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web engines, RDF graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. Thus efficient approaches to compress them are necessary. In this paper we empirically compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20 % of the original size of the strings is possible while supporting dictionary searches within a few microseconds, and up to 10 % within a few tens or hundreds of microseconds.