Results 11  20
of
35
Optimal Succinctness for Range Minimum Queries
"... Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the subarray A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in ord ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the subarray A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in order to answer future queries faster. We make the further assumption that the input array A cannot be used at query time. Under this assumption, a natural lower bound of 2n − Θ(log n) bits for RMQschemes exists. We give the first truly succinct preprocessing scheme for O(1)RMQs. Its final space consumption is 2n + o(n) bits, thus being asymptotically optimal. We also give a simple lineartime construction algorithm for this scheme that needs only n + o(n) bits of space in addition to the 2n + o(n) bits needed for the final data structure, thereby lowering the peak space consumption of previous schemes from O(n log n) to O(n) bits. We also improve on LCAcomputation in BPS and DFUDSencoded trees. 1
Fullycompressed suffix trees
 IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract

Cited by 20 (14 self)
 Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linearspace barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Fast BWT in small space by blockwise suffix sorting
 In Proc. DIMACS Working Group on the BurrowsWheeler Transform: Ten Years Later
"... The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with spaceefficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
The usual way to compute the Burrows–Wheeler transform (BWT) [3] of a text is by constructing the suffix array of the text. Even with spaceefficient suffix array construction algorithms [12, 2], the space requirement of the suffix array itself is often the main factor limiting the size of the text that can be handled in one piece, which is crucial for constructing compressed text indexes [4, 5]. Typically, the suffix array needs 4n bytes while the text and the BWT need only n bytes each and sometimes even less, for example 2n bits each for a DNA sequence. We reduce the space dramatically by constructing the suffix array in blocks of lexicographically consecutive suffixes. Given such a block, the corresponding block of the BWT is trivial to compute. Theorem 1 The BWT of a text of length n can be computed in O(n log n+n √ v +Dv) time (with high probability) and O(n / √ v + v) space (in addition to the text and the BWT), for any v ∈ [1, n]. Here Dv = ∑ i∈[0,n) min(di, v) = O(nv), where di is the length of the shortest unique substring starting at i. Proof (sketch). Assume first that the text has no repetitions longer than v, i.e., di ≤ v for all i. Choose a set of O(v) random suffixes that divide the suffix array into blocks. The sizes of the blocks
New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays
, 2004
"... Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Abstract This paper is about compressed fulltext indexes. That is, our goal is to represent fulltext indexes in as small space as possible and, at the same time, retain the functionality of the index. The most important functionality for a fulltext index is the ability to find out whether a given pattern string occurs inside the text string on which the index is built. In addition to supporting this existence query, fulltext indexes usually support counting queries and reporting queries; the former is for counting the number of times the pattern occurs in the text, and the latter is for reporting the exact locations of the occurrences. Suffix trees and arrays are wellknown fulltext indexes that support the above queries nearly optimally. This optimality refers only to the time complexity of the queries, since in space requirement neither are optimal; both structures occupy O(n log n) bits, where n is the length of the text. Notice that the text itself can be represented in n log oe bits, where oe is the alphabet size. Since the text (in some form) is crucial for the fulltext index, it is convenient to express the size of an index as the total size of the structure plus the text. Then obviously O(n log oe) space for a fulltext index would be optimal. For compressible texts it is still possible to achieve space requirement that is proportional to the entropy of the text.
Stronger LempelZiv Based Compressed Text Indexing
, 2008
"... Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Given a text T[1..u] over an alphabet of size σ, the fulltext search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed fulltext selfindices, which replace the text with a more spaceefficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZindex of Navarro is a compressed fulltext selfindex able to represent T using 4uHk(T) + o(u log σ) bits of space, where Hk(T) denotes the kth order empirical entropy of T, for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ+(m+occ) log u) worstcase time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other stateoftheart alternatives. In this paper we present stronger LempelZiv based indices, improving the overall performance of the LZindex. We achieve indices requiring (2+ǫ)uHk(T)+o(u log σ) bits of space, for any constant ǫ> 0, which makes our indices the smallest existing LZindices. We simultaneously improve the search time to
Suffix arrays on words
 In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, volume 4580 of LNCS
, 2007
"... Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at wordboundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a classnote solution to this problem that achieves such optimal time and space bound ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Abstract. Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at wordboundaries of a text T[1,n], taking O(n)timeandO(k) space in addition to T.Wepropose a classnote solution to this problem that achieves such optimal time and space bounds. Wordbased versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other wordindexes, and thus it foresees applications in wordbased approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that wordbased suffix arrays may beconstructed twice as fast as their fulltext counterparts, and with a working space as low as 20%. The space reduction of the final wordbased suffix array impacts also in their query time (i.e. less random access binarysearch steps!), being faster by a factor of up to 3. 1
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
MORE HASTE, LESS WASTE: LOWERING THE REDUNDANCY IN FULLY INDEXABLE DICTIONARIES
, 2009
"... We consider the problem of representing, in a compressed format, a bitvector S of m bits with n 1s, supporting the following operations, where b ∈ {0,1}: • rankb(S, i) returns the number of occurrences of bit b in the prefix S [1..i]; • selectb(S, i) returns the position of the ith occurrence of bi ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We consider the problem of representing, in a compressed format, a bitvector S of m bits with n 1s, supporting the following operations, where b ∈ {0,1}: • rankb(S, i) returns the number of occurrences of bit b in the prefix S [1..i]; • selectb(S, i) returns the position of the ith occurrence of bit b in S. Such a data structure is called fully indexable dictionary (fid) [Raman, Raman, and Rao, 2007], and is at least as powerful as predecessor data structures. Viewing S as a set X = {x1, x2,..., xn} of n distinct integers drawn from a universe [m] = {1,..., m}, the predecessor of integer y ∈ [m] in X is given by select1(S,rank1(S, y − 1)). fids have many applications in succinct and compressed data structures, as they are often involved in the construction of succinct representation for a variety of abstract data types. Our focus is on spaceefficient fids on the ram model with word size Θ(lg m) and constant time for all operations, so that the time cost is independent of the input size. Given the bitstring S to be encoded, having length m and containing n ones, the minimal amount of information that needs to be stored is B(n, m) = ⌈log ` ´ m ⌉. The n state of the art in building a fid for S is given in [Pǎtra¸scu, 2008] using B(m, n) + O(m/((log m/t) t)) + O(m 3/4) bits, to support the operations in O(t) time. Here, we propose a parametric data structure exhibiting a time/space tradeoff such that, for any real constants 0 < δ ≤ 1/2, 0 < ε ≤ 1, and integer s> 0, it uses
A Wordbased SelfIndexes for Natural Language Text
"... The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this art ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on singleword searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt selfindexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve wordbased selfindexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.