Results 1  10
of
13
Fullycompressed suffix trees
 IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract

Cited by 21 (15 self)
 Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linearspace barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Inverted indexes for phrases and strings
 In Proc. SIGIR
, 2011
"... Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the spa ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted
SpaceEfficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequencyrelated constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where superlinear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for Σ  << n (in particular for constant Σ), as the databases themselves occupy only n log Σ  bits. Because in many reallife applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log Σ  + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genomescale data. 1.
Range median of minima queries, super cartesian trees, and text indexing
"... A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median position ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
A Range Minimum Query asks for the position of a minimal element between two specified arrayindices. We consider a natural extension of this, where our further constraint is that if the minimum in a query interval is not unique, then the query should return an approximation of the median position among all positions that attain this minimum. We present a succinct preprocessing scheme using only about 2.54 n + o(n) bits in addition to the static input array, such that subsequent “range median of minima queries” can be answered in constant time. This data structure can be constructed in linear time, and only o(n) additional bits are needed at construction time. We introduce several new combinatorial concepts such as SuperCartesian Trees and SuperBallot Numbers, which we believe will have other interesting applications in the future. We stress the importance of our result by giving two applications in text indexing; in particular, we show that our ideas are needed for fast construction of one component in Compressed Suffix Trees [19], a versatile tool for numerous tasks in text processing, and that they can be used for fast pattern matching in (compressed) suffix arrays [14].
String Searching in Referentially Compressed Genomes
"... Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences be ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Genome compression, referential compression, string search Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a tobecompressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary:In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results:Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets. 1
MemoryEfficient GroupByAggregate using Compressed Buffer Trees
"... Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hashbased aggregation techniques: up to 2 × less memory and 1.5 × the throughput, at the cost of 2.5 × CPU. 1
On performance and cache effects in substring indexes
, 2007
"... This report evaluates the performance of uncompressed and compressed substring indexes on build time, space usage and search performance. It is shown how the structures react to increasing data size, alphabet size and repetitiveness in the data. The main contribution is the strong relationship shown ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This report evaluates the performance of uncompressed and compressed substring indexes on build time, space usage and search performance. It is shown how the structures react to increasing data size, alphabet size and repetitiveness in the data. The main contribution is the strong relationship shown between time performance and locality in the data structures. As an example, it is shown that for a large alphabet, suffix tree construction can be speeded up by a factor 16, and query lookup by a factor 8, if dynamic arrays are used to store the lists of children for each node instead of linked lists, at the cost of using about 20 % more space. And for enhanced suffix arrays, query lookup is up to twice as fast if the data structure is stored as an array of structs instead of a set of arrays, at no extra space cost. 1
MemoryEfficient GroupByAggregate using Compressed Buffer Trees
"... Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate ..."
Abstract
 Add to MetaCart
Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupByAggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hashbased aggregation techniques: up to 2 × less memory and 1.5 × the throughput, at the cost of 2.5 × CPU. 1