Results 1 - 10
of
42
XMill: an Efficient Compressor for XML Data
, 1999
"... We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogene ..."
Abstract
-
Cited by 165 (0 self)
- Add to MetaCart
We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogeneous XML data: it uses zlib, the library function for gzip, a collection of datatype specific compressors for simple data types, and, possibly, user defined compressors for application specific data types. 1 Introduction We have implemented a compressor/decompressor for XML data, to be used in data exchange and archiving, that achieves about twice the compression rate of general-purpose compressors (gzip), at about the same speed. The tool can be downloaded from www.research.att.com/sw/tools/xmill/. XML is now being adopted by many organizations and industry groups, like the healthcare, banking, chemical, and telecommunications industries. The attraction in XML is that it is a self-describi...
The Similarity Metric
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2003
"... A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it min ..."
Abstract
-
Cited by 137 (15 self)
- Add to MetaCart
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.
A compression algorithm for DNA sequences and its applications in genome comparison
, 1999
"... We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees. 1
Significantly Lower Entropy Estimates for Natural DNA Sequences
- Journal of Computational Biology
, 1996
"... If DNA were a random string over its alphabet fA; C; G; Tg, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estima ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
If DNA were a random string over its alphabet fA; C; G; Tg, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexac...
DNACompress: fast and effective DNA sequence compression. Bioinformatics 18:1696–1698
, 2002
"... Summary: While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs. ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Summary: While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs.
Detection of significant patterns by compression algorithms: the case of Approximate Tandem Repeats in DNA sequences.
, 1997
"... We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular ty ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Defined Ordered Sequence-DNA): approximate tandem repeats of small motifs (i.e. of lengths ! 4). This algorithm has been experimented over four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes. The algorithms in C are available by the World Wide Web (URL: http://www.lifl.fr/ rivals/Doc/RTA/ ). 3 Introduction A compression algorithm detects significant patterns in a text, if it encodes such patterns and achieves by the way a concise description of the whole text. The shorter the output description is, the more significant the patterns are. Consequently for a given text, the significance of the detect...
Improving Table Compression with Combinatorial Optimization
, 2002
"... We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line comp ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on the theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35--55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard.
Implementing the Context Tree Weighting Method for Text Compression
- In Data Compression Conference
, 2000
"... Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showe ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showed practical implementation using not block probabilities but conditional probabilities, it is used for only binary alphabet sequences. We extend the method for multi-alphabet sequences and show a simple implementation using PPM techniques. We also propose a method to optimize a parameter of the context tree weighting for binary alphabet case. Experimental results on texts and DNA sequences show that the performance of PPM can be improved by combining the context tree weighting and that DNA sequences can be compressed in less than 2.0 bpc.
Biological Sequence Compression Algorithms
- Genome Informatics
, 2000
"... Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Further ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequenices. The standard compression algorithms such as gzip or compress cannot compress DNA sequences but only expand them in size. On the other hand CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do nJ use special structures of biological sequencal Two characteristic structures of DNAsequen.# are kn wn On is calledpalinzLqqq or reverse complemen ts an the other structure is approximate repeats. Several specific algorithms for DNA sequenNz that use these structurescan compress them lessthan two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNAsequenzO are available. Beforeeno din the neJ symbol, the algorithm searchesan approximate repeat an palinNLFJ usin hashan dynJLx programminq If there is apalinJwq. oran approximate repeat withenhJL lenJL then our algorithmrepresen ts it withlenJq an distanNF By using this preprocessing anL program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.
DNA-Based Cryptography
- 5th DIMACS Workshop on DNA Based Computers, MIT
, 1999
"... Recent research has considered DNA as a medium for ultra-scale computation and for ultracompact information storage. One potential key application is DNA-based, molecular cryptography systems. We present some procedures for DNA-based cryptography based on one-time-pads that are in principle unbre ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Recent research has considered DNA as a medium for ultra-scale computation and for ultracompact information storage. One potential key application is DNA-based, molecular cryptography systems. We present some procedures for DNA-based cryptography based on one-time-pads that are in principle unbreakable. Practical applications of cryptographic systems based on onetime -pads are limited in conventional electronic media by the size of the one-time-pad; however DNA provides a much more compact storage medium, and an extremely small amount of DNA su#ces even for huge one-time-pads. We detail procedures for two DNA one-time-pad encryption schemes: (i) a substitution method using libraries of distinct pads, each of which defines a specific, randomly generated, pair-wise mapping; and (ii) an XOR scheme utilizing molecular computation and indexed, random key strings. These methods can be applied either for the encryption of natural DNA or for artificial DNA encoding binary data. In the latter case, we also present a novel use of chip-based DNA micro-array technology for 2D data input and output.

