Results 1  10
of
16
Application of LempelZiv factorization to the approximation of grammarbased compression
, 2003
"... We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) t ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) time transformation of LZ77 encoding of size k into a grammarbased encoding of size O(klU n).
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
, 2002
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data co ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar...
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Efficient Discovery of Loop Nests in Communication Traces of Parallel Programs
, 2008
"... Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace based on Crochemore's algorithm. The paper also introduces a greedy algorithm for fast ``near optimal' ' loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average
doi:10.1155/2007/43670 Research Article MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress
"... We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, sin ..."
Abstract
 Add to MetaCart
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammarbased coding methods, such as DNA Sequitur, while retaining a twopart code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit twopart coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation. Copyright © 2007 General Electric Company. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1.
Efficient Discovery of Loop Nests in Execution Traces
"... Abstract—Execution and communication traces are central to performance modeling and analysis. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input st ..."
Abstract
 Add to MetaCart
Abstract—Execution and communication traces are central to performance modeling and analysis. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace. The discovery of loop nests makes construction of compressed representative traces straightforward. The paper also introduces a greedy algorithm for fast “near optimal ” loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average processing time of 16.5 seconds for an average trace length of 71695 MPI events. Index Terms—Trace compression, loop discovery, performance modeling I.
Approximating the Smallest Grammar: . . .
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data ..."
Abstract
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar is known to be hard to approximate to within a constant factor, and an o(log n=log log n) approximation would require progress on a longstanding algebraic problem [10]. Previously, the best proved approximation ratio was O(n1=2) for the Bisection algorithm [8]. Our main result is an exponential improvement of this ratio; we give an O(log(n=g\Lambda)) approximation algorithm, where g\Lambda is the size of the smallest grammar. We then consider other computable variants of Kolomogorov complexity. In particular we give an O(log² n) approximation for the smallest nondeterministic finite automaton with advice that produces a given string. We also apply our techniques to "advicegrammars" and "editgrammars", two other natural models of string complexity.
Abstract Approximation Algorithms for GrammarBased Compression
"... Several recentlyproposed data compression algorithms are based on the idea of representing a string by a contextfree grammar. Most of these algorithms are known to be asymptotically optimal with respect to a stationary ergodic source and to achieve a low redundancy rate. However, such results do ..."
Abstract
 Add to MetaCart
Several recentlyproposed data compression algorithms are based on the idea of representing a string by a contextfree grammar. Most of these algorithms are known to be asymptotically optimal with respect to a stationary ergodic source and to achieve a low redundancy rate. However, such results do not reveal how effectively these algorithms exploit the grammarmodel itself; that is, are the compressed strings produced as small as possible? We address this issue by analyzing the approximation ratio of several algorithms, that is, the maximum ratio between the size of the generated grammar and the smallest possible grammar over all inputs. On the negative side, we show that every polynomialtime grammarcompression algorithm has approximation ratio at least ~s589 unless P = NP. Moreover, achieving an approximation ratio of o(log n/log log n) would require progress on an algebraic problem in a wellstudied area. We then upper and lower bound approximation ratios for the following four previouslyproposed grammarbased compression algorithms: SEQUENTIAL, BISECTION, GREEDY, and LZ78, each of which employs a distinct approach to compression. These results seem to indicate that there is much room to improve grammarbased compression algorithms. 1