Results 1  10
of
11
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
On Compressing and Indexing Repetitive Sequences
, 2011
"... We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to te ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This selfindex is particularly effective to represent highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.
Improved grammarbased compressed indexes
 In Proc. 19th SPIRE, LNCS 7608
, 2012
"... Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (meas ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammarbased representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + ɛ n lg n + o(N lg n) bits of space, for any 0 < ɛ ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in O (m 2 /ɛ) lg lg u lg n + (m + occ) lg n time, and extract any substring of length ℓ of T in time O(ℓ + h lg(N/h)), where h is the height of the grammar tree.
Indexing Highly Repetitive Collections
"... Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and LempelZiv compressed indexes. 1
Algorithms and Limits for Compact Plan Representations
"... Compact representations of objects is a common concept in computer science. Automated planning can be viewed as a case of this concept: a planning instance is a compact implicit representation of a graph and the problem is to find a path (a plan) in this graph. While the graphs themselves are repres ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Compact representations of objects is a common concept in computer science. Automated planning can be viewed as a case of this concept: a planning instance is a compact implicit representation of a graph and the problem is to find a path (a plan) in this graph. While the graphs themselves are represented compactly as planning instances, the paths are usually represented explicitly as sequences of actions. Some cases are known where the plans always have compact representations, for example, using macros. We show that these results do not extend to the general case, by proving a number of bounds for compact representations of plans under various criteria, like efficient sequential or random access of actions. In addition to this, we show that our results have consequences for what can be gained from reformulating planning into some other problem. As a contrast to this we also prove a number of positive results, demonstrating restricted cases where plans do have useful compact representations, as well as proving that macro plans have favourable access properties. Our results are finally discussed in relation to other relevant contexts. 1.
XML Compression via DAGs
, 2013
"... Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = nu ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. This is because optional elements are more likely to appear towards the end of child sequences.
Sorting a Compressed List
, 2012
"... We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich inf ..."
Abstract
 Add to MetaCart
We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich information about the input that is implicit in the compressed form. In particular, exploiting this information from the compression may eliminate the need to decompress, and may also enable algorithmic improvements that provide substantial speedups. We thus suggest a more rigorous study of what we call compressionaware algorithms. Already the stringmatching community has applied this idea to developing surprisingly efficient pattern matching and edit distance algorithms on compressed strings. In this paper, we begin to study the problem of sorting on compressed lists. Given an LZ77 representation of size C that decompresses to an array of length n, our algorithm can output an LZ77compressed representation of the sorted dataset in O(C + Σ  log Σ  + n) time, with Σ as the alphabet. Secondly, we consider a compression scheme in which an ninteger array is represented as the union of C arithmetic sequences. Using priority queues, we can sort the array in O(n log C) time. Lastly, given an array compressed with a context free grammar of size C we can find the sorted array in O(C · Σ), where Σ is the alphabet of the string. Additionally we present algorithms for indexing an LZ77 compressed string in O(C), and 1.1
Macros, Reactive Plans and Compact Representations
"... Abstract. The use and study of compact representations of objects is widespread in computer science. AI planning can be viewed as the problem of finding a path in a graph that is implicitly described by a compact representation in a planning language. However, compact representations of the path its ..."
Abstract
 Add to MetaCart
Abstract. The use and study of compact representations of objects is widespread in computer science. AI planning can be viewed as the problem of finding a path in a graph that is implicitly described by a compact representation in a planning language. However, compact representations of the path itself (the plan) have not received much attention in the literature. Although both macro plans and reactive plans can be considered as such compact representations, little emphasis has been placed on this aspect in earlier work. There are also compact plan representations that are defined by their access properties, for instance, that they have efficient random access or efficient sequential access. We formally compare two such concepts with macro plans and reactive plans, viewed as compact representations, and provide a complete map of the relationships between them. 1