Results 1 -
5 of
5
Clustering by compression
- IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract
-
Cited by 120 (12 self)
- Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.
A New Quartet Tree Heuristic for Hierarchical Clustering
- EUPASCAL Statistics and Optimization of Clustering Workshop, 5-6 Juli 2005
, 2006
"... We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal t ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects
- BMC EVOLUTIONARY BIOLOGY
, 2007
"... ..."
BMC Evolutionary Biology BioMed Central
, 2006
"... Research article Short-wavelength sensitive opsin (SWS1) as a new marker for vertebrate phylogenetics ..."
Abstract
- Add to MetaCart
Research article Short-wavelength sensitive opsin (SWS1) as a new marker for vertebrate phylogenetics
RESEARCH ARTICLE Open Access Detecting the symplesiomorphy trap: a multigene phylogenetic analysis of terebelliform annelids
"... Background: For phylogenetic reconstructions, conflict in signal is a potential problem for tree reconstruction. For instance, molecular data from different cellular components, such as the mitochondrion and nucleus, may be inconsistent with each other. Mammalian studies provide one such case of con ..."
Abstract
- Add to MetaCart
Background: For phylogenetic reconstructions, conflict in signal is a potential problem for tree reconstruction. For instance, molecular data from different cellular components, such as the mitochondrion and nucleus, may be inconsistent with each other. Mammalian studies provide one such case of conflict where mitochondrial data, which display compositional biases, support the Marsupionta hypothesis, but nuclear data confirm the Theria hypothesis. Most observations of compositional biases in tree reconstruction have focused on lineages with different composition than the majority of the lineages under analysis. However in some situations, the position of taxa that lack compositional bias may be influenced rather than the position of taxa that possess compositional bias. This situation is due to apparent symplesiomorphic characters and known as “the symplesiomorphy trap”. Results: Herein, we report an example of the sympleisomorphy trap and how to detect it. Worms within Terebelliformia (sensu Rouse & Pleijel 2001) are mainly tube-dwelling annelids comprising five ‘families’: Alvinellidae, Ampharetidae, Terebellidae, Trichobranchidae and Pectinariidae. Using mitochondrial genomic data, as well as data from the nuclear 18S, 28S rDNA and elongation factor-1a genes, we revealed incongruence between mitochondrial and nuclear data regarding the placement of Trichobranchidae. Mitochondrial data favored a sister relationship between Terebellidae and Trichobranchidae, but nuclear data placed Trichobranchidae as sister to an

