Results 1 -
9 of
9
Application of Lempel-Ziv factorization to the approximation of grammar-based compression
, 2003
"... We introduce new type of context-free grammars, AVL-grammars, and show theirappl7#B#BZ87 to grammar-based compression. Using this type of grammars we present O(nl7 time and O(lZ n)-ratio approximation ofminimal grammar-based compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) t ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
We introduce new type of context-free grammars, AVL-grammars, and show theirappl7#B#BZ87 to grammar-based compression. Using this type of grammars we present O(nl7 time and O(lZ n)-ratio approximation ofminimal grammar-based compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) time transformation of LZ77 encoding of size k into a grammar-based encoding of size O(klU n).
The Smallest Grammar Problem
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worst-case behavior) to establish provable performance guarantees and to address short-comings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammar-based compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammar-based compression and LZ77.
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
, 2002
"... We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data co ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar...
Random Access to Grammar-Compressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammar-compressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavy-paths in grammars.
Efficient Discovery of Loop Nests in Communication Traces of Parallel Programs
, 2008
"... Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace based on Crochemore's algorithm. The paper also introduces a greedy algorithm for fast ``near optimal' ' loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average
Self-Indexed Grammar-Based Compression
, 2001
"... Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T [1, u], then an SLP-compressed representation of T requires 2n log 2 n bits. For that same SLP, our self-index takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar
Indexing Straight-Line Programs ∗
, 2008
"... Straight-line programs offer powerful text compression by representing a text T[1, u] in terms of a context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar repre ..."
Abstract
- Add to MetaCart
Straight-line programs offer powerful text compression by representing a text T[1, u] in terms of a context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). Its size is of the same order of that of a plain SLP representation, and it can be of independent interest for other grammar-based problems. We also give some byproducts on representing binary relations. 1 Introduction and Related Work Grammar-based compression is a well-known technique since at least the seventies [53, 50, 3, 28, 47], and still a very active area of research stimulated by the recent interest in XML compression [33, 22, 37]. The main idea is to replace a given text T[1,u] by a context-free grammar (CFG) from which T can be derived. In fact, two different approaches fall under the same name [28]. In the
On the Complexity of Optimal Grammar-Based Compression
, 2004
"... The task of grammar-based compression is to find a small context-free grammar generating exactly one given string. We investigate the relationship between grammar-based compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an ..."
Abstract
- Add to MetaCart
The task of grammar-based compression is to find a small context-free grammar generating exactly one given string. We investigate the relationship between grammar-based compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an unbounded alphabet into a grammar for a block coding of that string over a fixed bounded alphabet and vice versa. From these constructions, we obtain asymptotically tight relationships between the minimum grammar sizes for strings and their block codings. Finally, we exploit an improved bound of our construction for overlap-free block codings to show that a polynomial time algorithm for approximating the minimum grammar for binary strings within a factor of c yields a polynomial time algorithm for approximating the minimum grammar for strings over arbitrary alphabets within a factor of 24c + ε (for arbitrary ε> 0). Since the latter problem is known to be NP-hard to approximate within a factor of 8569/8568, we provide a first step towards solving the long standing open question whether minimum grammar-based compression of binary strings is NP-complete.
Approximating the Smallest Grammar: . . .
"... We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data ..."
Abstract
- Add to MetaCart
We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar is known to be hard to approximate to within a constant factor, and an o(log n=log log n) approximation would require progress on a long-standing algebraic problem [10]. Previously, the best proved approximation ratio was O(n1=2) for the Bisection algorithm [8]. Our main result is an exponential improvement of this ratio; we give an O(log(n=g\Lambda)) approximation algorithm, where g\Lambda is the size of the smallest grammar. We then consider other computable variants of Kolomogorov complexity. In particular we give an O(log² n) approximation for the smallest non-deterministic finite automaton with advice that produces a given string. We also apply our techniques to "advice-grammars" and "edit-grammars", two other natural models of string complexity.

