Results 1  10
of
11
Application of LempelZiv factorization to the approximation of grammarbased compression
, 2003
"... We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) t ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) time transformation of LZ77 encoding of size k into a grammarbased encoding of size O(klU n).
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
, 2002
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data co ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar...
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Efficient Discovery of Loop Nests in Communication Traces of Parallel Programs
, 2008
"... Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace based on Crochemore's algorithm. The paper also introduces a greedy algorithm for fast ``near optimal' ' loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average
Indexing StraightLine Programs ∗
, 2008
"... Straightline programs offer powerful text compression by representing a text T[1, u] in terms of a contextfree grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar repre ..."
Abstract
 Add to MetaCart
Straightline programs offer powerful text compression by representing a text T[1, u] in terms of a contextfree grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). Its size is of the same order of that of a plain SLP representation, and it can be of independent interest for other grammarbased problems. We also give some byproducts on representing binary relations. 1 Introduction and Related Work Grammarbased compression is a wellknown technique since at least the seventies [53, 50, 3, 28, 47], and still a very active area of research stimulated by the recent interest in XML compression [33, 22, 37]. The main idea is to replace a given text T[1,u] by a contextfree grammar (CFG) from which T can be derived. In fact, two different approaches fall under the same name [28]. In the
On the Complexity of Optimal GrammarBased Compression
, 2004
"... The task of grammarbased compression is to find a small contextfree grammar generating exactly one given string. We investigate the relationship between grammarbased compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an ..."
Abstract
 Add to MetaCart
The task of grammarbased compression is to find a small contextfree grammar generating exactly one given string. We investigate the relationship between grammarbased compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an unbounded alphabet into a grammar for a block coding of that string over a fixed bounded alphabet and vice versa. From these constructions, we obtain asymptotically tight relationships between the minimum grammar sizes for strings and their block codings. Finally, we exploit an improved bound of our construction for overlapfree block codings to show that a polynomial time algorithm for approximating the minimum grammar for binary strings within a factor of c yields a polynomial time algorithm for approximating the minimum grammar for strings over arbitrary alphabets within a factor of 24c + ε (for arbitrary ε> 0). Since the latter problem is known to be NPhard to approximate within a factor of 8569/8568, we provide a first step towards solving the long standing open question whether minimum grammarbased compression of binary strings is NPcomplete.
Approximating the Smallest Grammar: . . .
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data ..."
Abstract
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar is known to be hard to approximate to within a constant factor, and an o(log n=log log n) approximation would require progress on a longstanding algebraic problem [10]. Previously, the best proved approximation ratio was O(n1=2) for the Bisection algorithm [8]. Our main result is an exponential improvement of this ratio; we give an O(log(n=g\Lambda)) approximation algorithm, where g\Lambda is the size of the smallest grammar. We then consider other computable variants of Kolomogorov complexity. In particular we give an O(log² n) approximation for the smallest nondeterministic finite automaton with advice that produces a given string. We also apply our techniques to "advicegrammars" and "editgrammars", two other natural models of string complexity.
doi:10.1155/2007/43670 Research Article MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress
"... We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, sin ..."
Abstract
 Add to MetaCart
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammarbased coding methods, such as DNA Sequitur, while retaining a twopart code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit twopart coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation. Copyright © 2007 General Electric Company. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1.