Results 1 - 10
of
31
Space-efficient algorithms for document retrieval
- IN PROC. CPM, VOLUME 4580 OF LNCS
, 2007
"... We study the Document Listing problem, where a collection D of documents d1,..., dk of total length � di = n is to be pre-i processed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an opti ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
We study the Document Listing problem, where a collection D of documents d1,..., dk of total length � di = n is to be pre-i processed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan’s solution from O(nlog n) bits to |CSA | + 2n + nlog k(1 + o(1)) bits, where |CSA | ≤ nlog |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(m log |Σ | + ndoc log k). Sadakane (ISAAC
Fully-compressed suffix trees
- IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Optimal Succinctness for Range Minimum Queries
"... Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the sub-array A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in ord ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the sub-array A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in order to answer future queries faster. We make the further assumption that the input array A cannot be used at query time. Under this assumption, a natural lower bound of 2n − Θ(log n) bits for RMQ-schemes exists. We give the first truly succinct preprocessing scheme for O(1)-RMQs. Its final space consumption is 2n + o(n) bits, thus being asymptotically optimal. We also give a simple linear-time construction algorithm for this scheme that needs only n + o(n) bits of space in addition to the 2n + o(n) bits needed for the final data structure, thereby lowering the peak space consumption of previous schemes from O(n log n) to O(n) bits. We also improve on LCA-computation in BPS- and DFUDS-encoded trees. 1
Fully-functional static and dynamic succinct trees. CoRR abs/0905.0768. http://arxiv.org/abs/0905.0768. Version 4
, 2010
"... We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and various operations on the tree can be supported in constant time under the word-RAM model. However the data structures are c ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and various operations on the tree can be supported in constant time under the word-RAM model. However the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature, to a few primitives that are carried out in constant time on sufficiently small trees. The result is extended to trees of arbitrary size, achieving 2n + O(n/polylog(n)) bits of space. The redundancy is significantly lower than any previous proposal. For the dynamic case, where insertion/deletion of nodes is allowed, the existing data structures support very limited operations. Our data structure builds on the range min-max tree to achieve 2n + O(n / log n) bits of space and O(log n) time for all the operations. We also propose an improved data structure using 2n+O(n loglog n / logn) bits and improving the time to O(log n / loglog n) for most operations. 1
Fully-functional succinct trees
- In Proc. 21st SODA
, 2010
"... We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and a large number of operations on the tree can be supported in constant time under the word-RAM model. However existing data s ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and a large number of operations on the tree can be supported in constant time under the word-RAM model. However existing data structures are not satisfactory in both theory and practice because (1) the lower-order term is Ω(nlog log n / log n), which cannot be neglected in practice, (2) the hidden constant is also large, (3) the data structures are complicated and difficult to implement, and (4) the techniques do not extend to dynamic trees supporting insertions and deletions of nodes. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives, which are carried out in constant time on sufficiently small trees. The result is then extended to trees of arbitrary size, achieving 2n + O(n/polylog(n)) bits of space. The redundancy is significantly lower than in any previous proposal, and the data structure is easily implemented. Furthermore, using the same framework, we derive the first fully-functional dynamic succinct trees. 1
Top-k Ranked Document Search in General Text Databases
"... Abstract. Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new se ..."
Abstract
-
Cited by 13 (10 self)
- Add to MetaCart
Abstract. Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued. 1
An(other) entropy-bounded compressed suffix tree
- In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching, volume 5029 of LNCS
, 2008
"... Abstract. Suffix trees are among the most important data structures in stringology, with myriads of applications. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. We present a novel compressed suffix tree. Compare ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
Abstract. Suffix trees are among the most important data structures in stringology, with myriads of applications. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. We present a novel compressed suffix tree. Compared to the existing ones, ours is the first achieving at the same time sublogarithmic complexity for the operations, and space usage which goes to zero as the entropy of the text does. Our development contains several novel ideas, such as compressing the longest common prefix information, and totally getting rid of the suffix tree topology, expressing all the suffix tree operations using range minimum queries and a new primitive called next/previous smaller value in a sequence. 1
Faster Entropy-Bounded Compressed Suffix Trees
, 2009
"... Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix t ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
Colored Range Queries and Document Retrieval
"... Abstract. Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract. Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence. 1
Succinct Trees in Practice
"... We implement and compare the major current techniques for representing general trees in succinct form. This is important because a general tree of n nodes is usually represented in pointer form, requiring O(n log n) bits, whereas the succinct representations we study require just 2n + o(n) bits and ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
We implement and compare the major current techniques for representing general trees in succinct form. This is important because a general tree of n nodes is usually represented in pointer form, requiring O(n log n) bits, whereas the succinct representations we study require just 2n + o(n) bits and carry out many sophisticated operations in constant time. Yet, there is no exhaustive study in the literature comparing the practical magnitudes of the o(n)-space and the O(1)-time terms. The techniques can be classified into three broad trends: those based on BP (balanced parentheses in preorder), those based on DFUDS (depth-first unary degree sequence), and those based on LOUDS (level-ordered unary degree sequence). BP and DFUDS require a balanced parentheses representation that supports the core operations

