Results 1  10
of
10
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 183 (23 self)
 Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
Kolmogorov’s structure functions and model selection
 IEEE Trans. Inform. Theory
"... approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The “structure function ” of the given data expresses the relation between the complexity l ..."
Abstract

Cited by 33 (13 self)
 Add to MetaCart
approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The “structure function ” of the given data expresses the relation between the complexity level constraint on a model class and the least logcardinality of a model in the class containing the data. We show that the structure function determines all stochastic properties of the data: for every constrained model class it determines the individual bestfitting model in the class irrespective of whether the “true ” model is in the model class considered or not. In this setting, this happens with certainty, rather than with high probability as is in the classical case. We precisely quantify the goodnessoffit of an individual model with respect to individual data. We show that—within the obvious constraints—every graph is realized by the structure function of some data. We determine the (un)computability properties of the various functions contemplated and of the “algorithmic minimal sufficient statistic.” Index Terms— constrained minimum description length (ML) constrained maximum likelihood (MDL) constrained bestfit model selection computability lossy compression minimal sufficient statistic nonprobabilistic statistics Kolmogorov complexity, Kolmogorov Structure function prediction sufficient statistic
The miraculous universal distribution
 Mathematical Intelligencer
, 1997
"... scientific hypothesis formulated? How does one choose one hypothesis over another? It may be surprising that questions such as these are still discussed. Even more surprising, perhaps, is the fact that the discussion is still moving forward, that new ideas are still being added to the debate. Certai ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
scientific hypothesis formulated? How does one choose one hypothesis over another? It may be surprising that questions such as these are still discussed. Even more surprising, perhaps, is the fact that the discussion is still moving forward, that new ideas are still being added to the debate. Certainly most surprising of all is the fact that, over the last thirty years or so, the normally concrete field of computer science has provided fundamental new insights. Scientists engage in what is usually called inductive reasoning. Inductive reasoning entails making predictions about future behavior based on past observations. But defining the proper method of formulating such predictions has occupied philosophers throughout the ages. In fact, the British philosopher David Hume (1711 – 1776) has argued convincingly that in some sense proper induction is impossible, [3]. It is impossible because we can only reach conclusions by using known data and methods. Therefore, the conclusion is logically already contained in the start configuration. Consequently, the only form of induction possible is deduction. Philosophers have tried to find
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Clustering
, 2009
"... The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We pr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (nonphylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.
DOI 10.1007/s0028301293428
"... 1 23Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This eoffprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to selfarchive your work, please use the accepted author’s vers ..."
Abstract
 Add to MetaCart
1 23Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This eoffprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to selfarchive your work, please use the accepted author’s version for posting to your own website or your institution’s repository. You may further deposit the accepted author’s version on a funder’s repository at a funder’s request, provided it is not made publicly available until 12 months after publication.
Optimizing Quartet Trees Through Monte Carlo Methods
"... The problem is to construct an optimal weight tree from the 3 � � n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We ..."
Abstract
 Add to MetaCart
The problem is to construct an optimal weight tree from the 3 � � n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package. Index Terms — evolutionary tree, global optimization, Monte Carlo method, quartet method, randomized hillclimbing,
Clustering by Compression
"... Abstract — We present a new method for clustering based on compression. The method doesn’t use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the ..."
Abstract
 Add to MetaCart
Abstract — We present a new method for clustering based on compression. The method doesn’t use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram
Tolstoy’s Mathematics in “War and Peace”
, 2001
"... views on sociology and history on mathematical and probabilistic views, and he also proposed a mathematical theory of waging war. 1 ..."
Abstract
 Add to MetaCart
views on sociology and history on mathematical and probabilistic views, and he also proposed a mathematical theory of waging war. 1