Results 11 - 20
of
30
Fully-compressed suffix trees
- IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Optimal Succinctness for Range Minimum Queries
"... Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the sub-array A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in ord ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the sub-array A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in order to answer future queries faster. We make the further assumption that the input array A cannot be used at query time. Under this assumption, a natural lower bound of 2n − Θ(log n) bits for RMQ-schemes exists. We give the first truly succinct preprocessing scheme for O(1)-RMQs. Its final space consumption is 2n + o(n) bits, thus being asymptotically optimal. We also give a simple linear-time construction algorithm for this scheme that needs only n + o(n) bits of space in addition to the 2n + o(n) bits needed for the final data structure, thereby lowering the peak space consumption of previous schemes from O(n log n) to O(n) bits. We also improve on LCA-computation in BPS- and DFUDS-encoded trees. 1
Fully-functional static and dynamic succinct trees. CoRR abs/0905.0768. http://arxiv.org/abs/0905.0768. Version 4
, 2010
"... We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and various operations on the tree can be supported in constant time under the word-RAM model. However the data structures are c ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and various operations on the tree can be supported in constant time under the word-RAM model. However the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature, to a few primitives that are carried out in constant time on sufficiently small trees. The result is extended to trees of arbitrary size, achieving 2n + O(n/polylog(n)) bits of space. The redundancy is significantly lower than any previous proposal. For the dynamic case, where insertion/deletion of nodes is allowed, the existing data structures support very limited operations. Our data structure builds on the range min-max tree to achieve 2n + O(n / log n) bits of space and O(log n) time for all the operations. We also propose an improved data structure using 2n+O(n loglog n / logn) bits and improving the time to O(log n / loglog n) for most operations. 1
Fast BWT in small space by blockwise suffix sorting
- In Proc. DIMACS Working Group on the Burrows-Wheeler Transform: Ten Years Later
"... The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with space-efficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with space-efficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text that can be handled in one piece, which is crucial for constructing compressed text indexes [4, 5]. Typically, the suffix array needs 4n bytes while the text and the BWT need only n bytes each and sometimes even less, for example 2n bits each for a DNA sequence. We reduce the space dramatically by constructing the suffix array in blocks of lexicographically consecutive suffixes. Given such a block, the corresponding block of the BWT is trivial to compute. Theorem 1 The BWT of a text of length n can be computed in O(n log n+n √ v +Dv) time (with high probability) and O(n / √ v + v) space (in addition to the text and the BWT), for any v ∈ [1, n]. Here Dv = ∑ i∈[0,n) min(di, v) = O(nv), where di is the length of the shortest unique substring starting at i. Proof (sketch). Assume first that the text has no repetitions longer than v, i.e., di ≤ v for all i. Choose a set of O(v) random suffixes that divide the suffix array into blocks. The sizes of the blocks
New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays
, 2004
"... Abstract This paper is about compressed full-text indexes. That is, our goal is to represent full-text indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a full-text index is the ability to find out whether a given ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
Abstract This paper is about compressed full-text indexes. That is, our goal is to represent full-text indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a full-text index is the ability to find out whether a given pattern string occurs inside the text string on which the index is built. In addition to supporting this existence query, full-text indexes usually support counting queries and reporting queries; the former is for counting the number of times the pattern occurs in the text, and the latter is for reporting the exact locations of the occurrences. Suffix trees and arrays are well-known full-text indexes that support the above queries nearly optimally. This optimality refers only to the time complexity of the queries, since in space requirement neither are optimal; both structures occupy O(n log n) bits, where n is the length of the text. Notice that the text itself can be represented in n log oe bits, where oe is the alphabet size. Since the text (in some form) is crucial for the full-text index, it is convenient to express the size of an index as the total size of the structure plus the text. Then obviously O(n log oe) space for a full-text index would be optimal. For compressible texts it is still possible to achieve space requirement that is proportional to the entropy of the text.
Antisequential Suffix Sorting For BWT-Base Data Compression
- IEEE Transactions on Computers
, 2005
"... Abstract—Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a length-N input over a size-jXj alphabet, the worst-case complexities of these algorithms are ðN2Þ, OðjXjN logð N ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jXjÞÞ, and OðN jXj logð N jXjÞ q Þ, respectively. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worst-case performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worst-case behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable high-speed BWT-based compression systems.
Suffix arrays on words
- In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, volume 4580 of LNCS
, 2007
"... Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a class-note solution to this problem that achieves such optimal time and space bound ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in word-based approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may beconstructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3. 1
Stronger Lempel-Ziv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZ-index of Navarro is a compressed full-text self-index able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the k-th order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worst-case time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices, improving the overall performance of the LZ-index. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZ-indices. We simultaneously improve the search time to
Space-Efficient String Mining under Frequency Constraints
"... Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other f ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Σ, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 — e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Σ | << n (in particular for constant |Σ|), as the databases themselves occupy only n log |Σ | bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Σ | + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data. 1.

