Results 1 -
5 of
5
Clustering by compression
- IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract
-
Cited by 120 (12 self)
- Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.
The miraculous universal distribution
- Mathematical Intelligencer
, 1997
"... scientific hypothesis formulated? How does one choose one hypothesis over another? It may be surprising that questions such as these are still discussed. Even more surprising, perhaps, is the fact that the discussion is still moving forward, that new ideas are still being added to the debate. Certai ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
scientific hypothesis formulated? How does one choose one hypothesis over another? It may be surprising that questions such as these are still discussed. Even more surprising, perhaps, is the fact that the discussion is still moving forward, that new ideas are still being added to the debate. Certainly most surprising of all is the fact that, over the last thirty years or so, the normally concrete field of computer science has provided fundamental new insights. Scientists engage in what is usually called inductive reasoning. Inductive reasoning entails making predictions about future behavior based on past observations. But defining the proper method of formulating such predictions has occupied philosophers throughout the ages. In fact, the British philosopher David Hume (1711 – 1776) has argued convincingly that in some sense proper induction is impossible, [3]. It is impossible because we can only reach conclusions by using known data and methods. Therefore, the conclusion is logically already contained in the start configuration. Consequently, the only form of induction possible is deduction. Philosophers have tried to find
A New Quartet Tree Heuristic for Hierarchical Clustering
- EUPASCAL Statistics and Optimization of Clustering Workshop, 5-6 Juli 2005
, 2006
"... We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal t ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Clustering
, 2009
"... The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (non-phylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.
Tolstoy's Mathematics in "War and Peace"
"... Introduction It is interesting to consider the excursions of mathematicians and scientists into prose and poetry, and conversely and less known, the explorations of poets and novelists into mathematics. An example of the rst is Luitzen E.J. Brouwer's excursion into literature and environmentalism ..."
Abstract
- Add to MetaCart
Introduction It is interesting to consider the excursions of mathematicians and scientists into prose and poetry, and conversely and less known, the explorations of poets and novelists into mathematics. An example of the rst is Luitzen E.J. Brouwer's excursion into literature and environmentalism [1], an appeal avant la lettre to save the earths natural environment from human polution. In particular he wants to abolish the technology that enables man's supremacy over nature and the physics and mathematics that makes this possible. Only pure (`intuitionistic') mathematics that by its nature is unapplied and unapplicable for evil purposes, and which is the ultimate creation of the noble mind, should be saved. In another direction, the great Russian mathematician Andrei N. Kolmogorov was particularly interested in the form and structure of the poetry by the Russian author Pushkin [3]. He also remarks [4]: \what real meaning is there, for example, in asking how much information is cont

