Results 1  10
of
18
Application of LempelZiv factorization to the approximation of grammarbased compression
, 2003
"... We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) t ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
(Show Context)
We introduce new type of contextfree grammars, AVLgrammars, and show theirappl7#B#BZ87 to grammarbased compression. Using this type of grammars we present O(nl7 time and O(lZ n)ratio approximation ofminimal grammarbased compression of a given string oflZM,k n over anal,UMJ, # and O(klU n) time transformation of LZ77 encoding of size k into a grammarbased encoding of size O(klU n).
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
Textual data compression in computational biology: a synopsis
 Bioinformatics
, 2009
"... Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
(Show Context)
Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. Availability: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The supplementary material (see next) provides pointers to software and benchmark datasets for a range of applications of broad interest. Contact:
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models
, 2002
"... We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data co ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We consider the problem of finding the smallest contextfree grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar...
On the Complexity of Optimal GrammarBased Compression
, 2004
"... The task of grammarbased compression is to find a small contextfree grammar generating exactly one given string. We investigate the relationship between grammarbased compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
The task of grammarbased compression is to find a small contextfree grammar generating exactly one given string. We investigate the relationship between grammarbased compression of strings over unbounded and bounded alphabets. Specifically, we show how to transform a grammar for a string over an unbounded alphabet into a grammar for a block coding of that string over a fixed bounded alphabet and vice versa. From these constructions, we obtain asymptotically tight relationships between the minimum grammar sizes for strings and their block codings. Finally, we exploit an improved bound of our construction for overlapfree block codings to show that a polynomial time algorithm for approximating the minimum grammar for binary strings within a factor of c yields a polynomial time algorithm for approximating the minimum grammar for strings over arbitrary alphabets within a factor of 24c + ε (for arbitrary ε> 0). Since the latter problem is known to be NPhard to approximate within a factor of 8569/8568, we provide a first step towards solving the long standing open question whether minimum grammarbased compression of binary strings is NPcomplete.
Efficient Discovery of Loop Nests in Communication Traces of Parallel Programs
, 2008
"... Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Execution and communication traces are central to performance modeling and analysis of parallel applications. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace based on Crochemore's algorithm. The paper also introduces a greedy algorithm for fast ``near optimal' ' loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average
Data and text mining Textual data compression in computational biology: a synopsis
"... Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort ..."
Abstract
 Add to MetaCart
(Show Context)
Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. Availability: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at:
Efficient Discovery of Loop Nests in Execution Traces
"... Abstract—Execution and communication traces are central to performance modeling and analysis. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input st ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Execution and communication traces are central to performance modeling and analysis. Since the traces can be very long, meaningful compression and extraction of representative behavior is important. Commonly used compression procedures identify repeating patterns in sections of the input string and replace each instance with a representative symbol. This can prevent the identification of long repeating sequences corresponding to outer loops in a trace. This paper introduces and analyzes a framework for identifying the maximal loop nest from a trace. The discovery of loop nests makes construction of compressed representative traces straightforward. The paper also introduces a greedy algorithm for fast “near optimal ” loop nest discovery with well defined bounds. Results of compressing MPI communication traces of NAS parallel benchmarks show that both algorithms identified the basic loop structures correctly. The greedy algorithm was also very efficient with an average processing time of 16.5 seconds for an average trace length of 71695 MPI events. Index Terms—Trace compression, loop discovery, performance modeling I.