Results 1  10
of
13
Fullycompressed suffix trees
 IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract

Cited by 20 (14 self)
 Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linearspace barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Range median of minima queries, super cartesian trees, and text indexing
"... A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median position ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median position among all positions that attain this minimum. We present a succinct preprocessing scheme using only about 2.54 n + o(n) bits in addition to the static input array, such that subsequent “range median of minima queries” can be answered in constant time. This data structure can be constructed in linear time, and only o(n) additional bits are needed at construction time. We introduce several new combinatorial concepts such as SuperCartesian Trees and SuperBallot Numbers, which we believe will have other interesting applications in the future. We stress the importance of our result by giving two applications in text indexing; in particular, we show that our ideas are needed for fast construction of one component in Compressed Suffix Trees [19], a versatile tool for numerous tasks in text processing, and that they can be used for fast pattern matching in (compressed) suffix arrays [14].
SpaceEfficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequencyrelated constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where superlinear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for Σ  << n (in particular for constant Σ), as the databases themselves occupy only n log Σ  bits. Because in many reallife applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log Σ  + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genomescale data. 1.
Inverted indexes for phrases and strings
 In Proc. SIGIR
, 2011
"... Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the spa ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted
String Searching in Referentially Compressed Genomes
"... Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences be ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a tobecompressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary:In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results:Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets. 1
On performance and cache effects in substring indexes
, 2007
"... This report evaluates the performance of uncompressed and compressed substring indexes on build time, space usage and search performance. It is shown how the structures react to increasing data size, alphabet size and repetitiveness in the data. The main contribution is the strong relationship shown ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This report evaluates the performance of uncompressed and compressed substring indexes on build time, space usage and search performance. It is shown how the structures react to increasing data size, alphabet size and repetitiveness in the data. The main contribution is the strong relationship shown between time performance and locality in the data structures. As an example, it is shown that for a large alphabet, suffix tree construction can be speeded up by a factor 16, and query lookup by a factor 8, if dynamic arrays are used to store the lists of children for each node instead of linked lists, at the cost of using about 20 % more space. And for enhanced suffix arrays, query lookup is up to twice as fast if the data structure is stored as an array of structs instead of a set of arrays, at no extra space cost. 1
Finding Range Minima in the Middle: Approximations and Applications
"... Abstract. A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median ..."
Abstract
 Add to MetaCart
Abstract. A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median position among all positions that attain this minimum. We present a succinct preprocessing scheme using Dn+o(n) bits in addition to the static input array (small constant D), such that subsequent “range median of minima queries” can be answered in constant time. This data structure can be built in linear time, with little extra space needed at construction time. We introduce several new combinatorial concepts such as SuperCartesian Trees and SuperBallot Numbers. We give applications of our preprocessing scheme in text indexes such as (compressed) suffix arrays and trees.
Transactions on Computational Biology and Bioinformatics IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.??, NO.??, XX 20XX 1 1 2 3 4 5 6 7 8
"... Abstract—Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Smallscale and local repetitive structures are better understood than large and ..."
Abstract
 Add to MetaCart
Abstract—Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Smallscale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a spaceefficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19–50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time and spaceefficient manner. Our technique uses the BurrowsWheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our spaceefficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as generalpurpose opensource software for public use.
MemoryEfficient GroupByAggregate using Compressed Buffer Trees
"... Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate ..."
Abstract
 Add to MetaCart
Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hashbased aggregation techniques: up to 2 × less memory and 1.5 × the throughput, at the cost of 2.5 × CPU. 1