Results 1  10
of
78
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 178 (23 self)
 Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
A Linear Method for Deviation Detection in Large Databases
, 1996
"... We describe the problem of finding deviations in large data bases. Normally, explicit information outside the data, like integrity constraints or predefined patterns, is used for deviation detection. In contrast, we approach the problem from the inside of the data, using the implicit redundancy of t ..."
Abstract

Cited by 82 (1 self)
 Add to MetaCart
We describe the problem of finding deviations in large data bases. Normally, explicit information outside the data, like integrity constraints or predefined patterns, is used for deviation detection. In contrast, we approach the problem from the inside of the data, using the implicit redundancy of the data. We give a formal description of the problem and present a linear algorithm for detecting deviations. Our solution simulates a mechanism familiar to human beings: after seeing a series of similar data, an element disturbing the series is considered an exception. We also present experimental results from the application of this algorithm on reallife datasets showing its effectiveness.
A compression algorithm for DNA sequences and its applications in genome comparison
, 1999
"... We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main ..."
Abstract

Cited by 67 (4 self)
 Add to MetaCart
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees. 1
COMPLEXITY OF SELFASSEMBLED SHAPES
, 2007
"... The connection between selfassembly and computation suggests that a shape can be considered the output of a selfassembly “program,” a set of tiles that fit together to create a shape. It seems plausible that the size of the smallest selfassembly program that builds a shape and the shape’s descrip ..."
Abstract

Cited by 60 (4 self)
 Add to MetaCart
The connection between selfassembly and computation suggests that a shape can be considered the output of a selfassembly “program,” a set of tiles that fit together to create a shape. It seems plausible that the size of the smallest selfassembly program that builds a shape and the shape’s descriptional (Kolmogorov) complexity should be related. We show that when using a notion of a shape that is independent of scale, this is indeed so: in the tile assembly model, the minimal number of distinct tile types necessary to selfassemble a shape, at some scale, can be bounded both above and below in terms of the shape’s Kolmogorov complexity. As part of the proof, we develop a universal constructor for this model of selfassembly that can execute an arbitrary Turing machine program specifying how to grow a shape. Our result implies, somewhat counterintuitively, that selfassembly of a scaledup version of a shape often requires fewer tile types. Furthermore, the independence of scale in selfassembly theory appears to play the same crucial role as the independence of running time in the theory of computability. This leads to an elegant formulation of languages of shapes generated by selfassembly. Considering functions from bit strings to shapes, we show that the runningtime complexity, with respect to Turing machines, is polynomially equivalent to the scale complexity of the same function implemented via selfassembly by a finite set of tile types. Our results also hold for shapes defined by Wang tiling—where there is no sense of a selfassembly process—except that here time complexity must be measured with respect to nondeterministic Turing machines.
Quantum Coding
 Physical Review A
, 1995
"... The quantum analogues of classical variablelength codes are indeterminatelength quantum codes, in which codewords may exist in superpositions of different lengths. This paper explores some of their properties. The length observable for such codes is governed by a quantum version of the KraftMcMill ..."
Abstract

Cited by 54 (2 self)
 Add to MetaCart
The quantum analogues of classical variablelength codes are indeterminatelength quantum codes, in which codewords may exist in superpositions of different lengths. This paper explores some of their properties. The length observable for such codes is governed by a quantum version of the KraftMcMillan inequality. Indeterminatelength quantum codes also provide an alternate approach to quantum data compression.
Kolmogorov complexity and the Recursion Theorem. Manuscript, submitted for publication
, 2005
"... Abstract. Several classes of diagonally nonrecursive (DNR) functions are characterized in terms of Kolmogorov complexity. In particular, a set of natural numbers A can wttcompute a DNR function iff there is a nontrivial recursive lower bound on the Kolmogorov complexity of the initial segments of ..."
Abstract

Cited by 46 (11 self)
 Add to MetaCart
Abstract. Several classes of diagonally nonrecursive (DNR) functions are characterized in terms of Kolmogorov complexity. In particular, a set of natural numbers A can wttcompute a DNR function iff there is a nontrivial recursive lower bound on the Kolmogorov complexity of the initial segments of A. Furthermore, A can Turing compute a DNR function iff there is a nontrivial Arecursive lower bound on the Kolmogorov complexity of the initial segements of A. A is PAcomplete, that is, A can compute a {0, 1}valued DNR function, iff A can compute a function F such that F (n) is a string of length n and maximal Ccomplexity among the strings of length n. A ≥T K iff A can compute a function F such that F (n) is a string of length n and maximal Hcomplexity among the strings of length n. Further characterizations for these classes are given. The existence of a DNR function in a Turing degree is equivalent to the failure of the Recursion Theorem for this degree; thus the provided results characterize those Turing degrees in terms of Kolmogorov complexity which do no longer permit the usage of the Recursion Theorem. 1.
Complexities for Generalized Models of SelfAssembly
 In SODA
, 2004
"... Abstract. In this paper, we study the complexity of selfassembly under models that are natural generalizations of the tile selfassembly model. In particular, we extend Rothemund and Winfree’s log N study of the tile complexity of tile selfassembly [9]. They provided a lower bound of Ω ( log log N ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
Abstract. In this paper, we study the complexity of selfassembly under models that are natural generalizations of the tile selfassembly model. In particular, we extend Rothemund and Winfree’s log N study of the tile complexity of tile selfassembly [9]. They provided a lower bound of Ω ( log log N) on the tile complexity of assembling an N × N square for almost all N. Adleman et al. [1] gave a construction which achieves this bound. We consider whether the tile complexity for selfassembly can be reduced through several natural generalizations of the model. One of our results is a tile set of size O ( √ log N) which assembles an N × N square in a model which allows flexible glue strength between nonequal glues. This result is matched for almost all N by a lower bound dictated by log N Kolmogorov complexity. For three other generalizations, we show that the Ω ( ) lower bound log log N applies to N × N squares. At the same time, we demonstrate that there are some other shapes for which these generalizations allow reduced tile sets. Specifically, for thin rectangles with length N and width k, we provide a tighter lower bound of Ω ( N 1 k k log N construction which achieves O ( log log N) for the standard model, yet we also give a) complexity in a model in which the temperature of the tile system is adjusted during assembly. We also investigate the problem of verifying whether a given tile system uniquely assembles into a given shape; we show that this problem is NPhard for three of the generalized models.
Information Distance
, 1997
"... While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal inf ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two pictures. We give several natural definitions of a universal information metric, based on length of shortest programs for either ordinary computations or reversible (dissipationless) computations. It turns out that these definitions are equivalent up to an additive logarithmic term. We show that the information distance is a universal cognitive similarity distance. We investigate the maximal correlation of the shortest programs involved, the maximal uncorrelation of programs (a generalization of the SlepianWolf theorem of classical information theory), and the density properties of the discrete metric spaces induced by the information distances. A related distance measures the amount of nonreversibility of a computation. Using the physical theo...
Using random sets as oracles
"... Let R be a notion of algorithmic randomness for individual subsets of N. We say B is a base for R randomness if there is a Z �T B such that Z is R random relative to B. We show that the bases for 1randomness are exactly the Ktrivial sets and discuss several consequences of this result. We also sho ..."
Abstract

Cited by 34 (15 self)
 Add to MetaCart
Let R be a notion of algorithmic randomness for individual subsets of N. We say B is a base for R randomness if there is a Z �T B such that Z is R random relative to B. We show that the bases for 1randomness are exactly the Ktrivial sets and discuss several consequences of this result. We also show that the bases for computable randomness include every ∆ 0 2 set that is not diagonally noncomputable, but no set of PAdegree. As a consequence, we conclude that an nc.e. set is a base for computable randomness iff it is Turing incomplete. 1