Results 1  10
of
29
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 31 (18 self)
 Add to MetaCart
(Show Context)
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
On Compressing and Indexing Repetitive Sequences
, 2011
"... We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to te ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This selfindex is particularly effective to represent highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Improved grammarbased compressed indexes
 In Proc. 19th SPIRE, LNCS 7608
, 2012
"... Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (meas ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammarbased representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + ɛ n lg n + o(N lg n) bits of space, for any 0 < ɛ ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in O (m 2 /ɛ) lg lg u lg n + (m + occ) lg n time, and extract any substring of length ℓ of T in time O(ℓ + h lg(N/h)), where h is the height of the grammar tree.
Faster approximate pattern matching in compressed repetitive texts
 In Proceedings of the 22nd International Symposium on Algorithms and Computation (ISAAC
, 2011
"... Abstract. Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently sho ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straightline program with r rules for a string s of length n, we can build an O(r)word data structure that allows us to extract any substring s[i..j] in O(log n + j − i) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in O(r(min(mk, k4 + m) + log n) + occ) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straightline programs with O(z log n) rules. In this paper we give a simple O(z log n)word data structure that takes the same time for substring extraction but only O(z(min(mk, k4 + m)) + occ) time for approximate pattern matching. 1
Indexing Highly Repetitive Collections
"... Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and LempelZiv compressed indexes. 1
Grammar Compressed Sequences with Rank/Select Support?
"... Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammarbased representations for repetitive sequences, which use up to 10 % of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds. 1
XML Compression via DAGs
, 2013
"... Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = nu ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. This is because optional elements are more likely to appear towards the end of child sequences.
GrammarBased Compression in a Streaming Model
 LATA 2010. LNCS
, 2010
"... We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our pre ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our previous result that, with polylogarithmic memory and polylogarithmic passes over a single stream, we cannot build such a grammar whose size is within any polynomial of g.