Results 1  10
of
27
Data Compression
 ACM Computing Surveys
, 1987
"... This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effectiv ..."
Abstract

Cited by 87 (3 self)
 Add to MetaCart
This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested. INTRODUCTION Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of eff...
A compression algorithm for DNA sequences and its applications in genome comparison
, 1999
"... We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main ..."
Abstract

Cited by 67 (4 self)
 Add to MetaCart
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees. 1
Efficient Decoding of Prefix Codes
 Communications of the ACM
, 1990
"... We discuss representations of prefix codes and the corresponding storage space and decoding time requirements. We assume that a dictionary of words to be encoded has been defined and that a prefix code appropriate to the dictionary has been constructed. The encoding operation becomes simple given th ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
We discuss representations of prefix codes and the corresponding storage space and decoding time requirements. We assume that a dictionary of words to be encoded has been defined and that a prefix code appropriate to the dictionary has been constructed. The encoding operation becomes simple given these assumptions and given an appropriate parsing strategy, therefore we concentrate on decoding. The application which led us to this work constrains the use of internal memory during the decode operation. As a result, we seek a method of decoding which has a small memory requirement. Introduction Data compression is an important and muchstudied problem. Compressing data to be stored or transmitted can result in significant improvements in the use of computing resources. The degree of improvement that can be achieved depends not only on the selection of a data compression method, but also on the characteristics of the particular application. That is, no single data compression algorithm wi...
SelfAlignment in Words and their Applications
 J. Algorithms
, 1992
"... Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. T ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. Then, for each pair (W , Z) of distinct suffixes of X, the expected length of the longest common prefix of W and Z is sought. The collection of these lengths, that are called here selfalignments, plays a crucial role in several algorithmic problems on words, such as building suffix trees or inverted files, detecting squares and other regularities, computing substring statistics, etc. The asymptotically best algorithms for these problems are quite complex and thus risk to be unpractical. The present analysis of selfalignments and related measures suggests that, in a variety of cases, more straightforward algorithmic solutions may yield comparable or even better performances. Key words and ph...
Offline compression by greedy textual substitution
 PROC. IEEE
, 2000
"... Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring until substrings capable of producing contractions can no longer be found. This paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments.
Some Theory and Practice of Greedy Offline Textual Substitution
 Proc. Data Compression Conference, IEEE Computer
, 1998
"... Purdue University and Universit�a di Padova Greedy o��line textual substitution refers to the following steepest descent approach ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Purdue University and Universit�a di Padova Greedy o��line textual substitution refers to the following steepest descent approach
A Guaranteed Compression Scheme for Repetitive DNA Sequences
, 1995
"... We present a text compression scheme dedicated to DNA sequences. This algorithm has two computation phases. In the parsing phase, the suffix tree is built to select repeats for the dictionary. In the encoding phase, selected repetitions for which a guarantee of gain is established, are encoded. W ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We present a text compression scheme dedicated to DNA sequences. This algorithm has two computation phases. In the parsing phase, the suffix tree is built to select repeats for the dictionary. In the encoding phase, selected repetitions for which a guarantee of gain is established, are encoded. We prove a theorem that guarantees the compression gain and report some comparisons with classical compression schemes. Complete results are available by anonymous FTP at ftp.lifl.fr:/pub/BIC/biologie/Cfact. These experiments establish that DNA sequences require special compression methods and show our algorithm utility for classification purposes in biology.
Robust Universal Complete Codes for Transmission and Compression
 Discrete Applied Mathematics
, 1996
"... Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variablelength codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variablelength codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is constant and need not be generated for every probability distribution. These codes can be used as alternatives to Huffman codes when the optimal compression of the latter is not required, and simplicity, faster processing and robustness are preferred. The codes are compared on several "reallife" examples. 1. Motivation and Introduction Let A = fA 1 ; A 2 ; \Delta \Delta \Delta ; An g be a finite set of elements, called cleartext elements, to be encoded by a static uniquely decipherable (UD) code. For notational ease, we use the term `code' as abbreviation for `set of codewords'; the corresponding encoding and decoding algorithms are always either given or clear from the context. A code i...
A simpler analysis of BurrowsWheeler based compression
 In Proc. of the 17th Symposium on Combinatorial Pattern Matching (CPM ’06). SpringerVerlag LNCS
, 2006
"... In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
In this paper we present a new technique for worstcase analysis of compression algorithms which are based on the BurrowsWheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the BurrowsWheeler Transform of the text, 2) Convert the transform into a sequence of integers using the movetofront algorithm, 3) Encode the integers using Arithmetic code or any order0 encoding (possibly with runlength encoding). We achieve a strong upper bound on the worstcase compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ> 1, the length of the compressed string is bounded by µ · sHk(s)+ log(ζ(µ)) · s  + µgk + O(log n) where Hk is the kth order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 1 µ+ 2 µ+... is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worstcase bound on the compression ratio of a compression algorithm based on the BurrowsWheeler Transform followed by distance coding, for which worstcase guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · sHk(s) + gk + O(log n). This bound is better than the bound we give for bw0.
Fast Discerning Repeats in DNA Sequences with a Compression Algorithm
, 1997
"... Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shuffling, . . . Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanism ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Long direct repeats in genomes arise from molecular duplication mechanisms like retrotransposition, copy of genes, exon shuffling, . . . Their study in a given sequence reveals its internal repeat structure as well as part of its evolutionary history. Moreover, detailed knowledge about the mechanisms can be gained from a systematic investigation of repeats. The problem of finding such repeats is viewed as an NPcomplete problem of the optimal compression of a sequence thanks to the encoding of its exact repeats. The repeats chosen for compression must not overlap each other as do the repeats which result from molecular duplications. We present a new heuristic algorithm, Search Repeats, where the selection of exact repeats is guided by two biologically sound criteria: their length and the absence of overlap between those repeats. Search Repeats detects approximate repeats, as clusters of exact subrepeats, and points out large insertions/deletions in them. Search Repeats takes only 3 s...