Results 1  10
of
32
The Smallest Grammar Problem
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2005
"... This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addi ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
This paper addresses the smallest grammar problem: What is the smallest contextfree grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields, including data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem’s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, worstcase behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are a variety of hardness results, most notably that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569 unless P = NP. 8568 We then bound approximation ratios for several of the bestknown grammarbased compression algorithms, including LZ78, BISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and REPAIR. Among these, the best upper bound we show is O(n 1/2). We finish by presenting two novel algorithms with exponentially better ratios of O(log 3 n) and O(log(n/m ∗)), where m ∗ is the size of the smallest grammar for that input. The latter highlights a connection between grammarbased compression and LZ77.
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
A simple and fast DNA compressor
 Software  Practice and Experience
, 2004
"... In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. W ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an encoding designed to take advantage of the possible presence of approximate repeats. Our approach leads to an algorithm which is an order of magnitude faster than any other algorithm and achieves a compression ratio very close to the best DNA compressors. Another important feature of our algorithm is its small space occupancy which makes it possible to compress sequences hundreds of megabytes long, well beyond the range of any previous DNA compressor. 1
Grammarbased Compression of DNA Sequences
, 2004
"... Grammarbased compression algorithms infer contextfree grammars to represent the input data. The grammar is then transformed into a symbol stream and finally encoded in binary. We explore the utility of grammarbased compression of DNA sequences. We strive to optimize the three stages of grammarba ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Grammarbased compression algorithms infer contextfree grammars to represent the input data. The grammar is then transformed into a symbol stream and finally encoded in binary. We explore the utility of grammarbased compression of DNA sequences. We strive to optimize the three stages of grammarbased compression to work optimally for DNA. DNA is notoriously hard to compress, and ultimately, our algorithm fails to achieve better compression than the best competitor. 1
Approximation Algorithms for GrammarBased Data Compression
, 2002
"... This thesis considers the smallest grammar problem: find the smallest contextfree grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most im ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
This thesis considers the smallest grammar problem: find the smallest contextfree grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most importantly, there is a link to data compression; instead of storing a long string, one can store a small grammar that generates it. A small grammar for a string also naturally brings out underlying patterns, a fact that is useful, for example, in DNA analysis. Moreover, the size of the smallest contextfree grammar generating a string can be regarded as a computable relaxation of Kolmogorov complexity. Finally, work on the smallest grammar problem qualitatively extends the study of approximation algorithms to hierarchicallystructured objects. In this thesis, we establish hardness results, evaluate several previously proposed algorithms, and then present new procedures with much stronger approximation guarantees.
Structure induction by lossless graph compression
 In Proc. of the IEEE Data Compression Conference, 53–62
, 2007
"... ..."
(Show Context)
Tree structure compression with RePair
, 2010
"... Larsson and Moffat’s RePair algorithm is generalized from strings to trees. The new algorithm (TreeRePair) produces straightline linear contextfree tree (SLT) grammars which are smaller than those produced by previous grammarbased compressors such as BPLEX. Experiments show that a Huffmanbased ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
Larsson and Moffat’s RePair algorithm is generalized from strings to trees. The new algorithm (TreeRePair) produces straightline linear contextfree tree (SLT) grammars which are smaller than those produced by previous grammarbased compressors such as BPLEX. Experiments show that a Huffmanbased coding of the resulting grammars gives compression ratios comparable to the best known XML file compressors. Moreover, SLT grammars can be used as efficient memory representation of trees. Our investigations show that tree traversals over TreeRePair grammars are 14 times slower than over pointer structures and 5 times slower than over succinct trees, while memory consumption is only 1/43 and 1/6, respectively. 1
An approach to phrase selection for offline data compression
 Proc. 25th Australasian Computer Science Conference
, 2002
"... Recently several oJfline data compression schemes have been published that expend large amounts of computing resources when encoding a file, but decode the file quickly. These compressors work by identifying phrases in the input data, and storing the data as a series of pointer to these phrases. Thi ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Recently several oJfline data compression schemes have been published that expend large amounts of computing resources when encoding a file, but decode the file quickly. These compressors work by identifying phrases in the input data, and storing the data as a series of pointer to these phrases. This paper explores the application of an algorithm for computing all repeating substrings within a string for phrase selection in an offiine data compressor. Using our approach, we obtain compression similar to that of the best known offiine compressors on genetic data, but poor results on general text. It seems, however, that an alternate approach based on selecting repeating substrings is feasible. Keywords: strings, of_ fline data compression, textual substitution, repeating substrings 1