## A New Quartet Tree Heuristic for Hierarchical Clustering (2006)

### Cached

### Download Links

- [arxiv.org]
- [www.cwi.nl]
- [www.cs.bham.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | EUPASCAL Statistics and Optimization of Clustering Workshop, 5-6 Juli 2005 |

Citations: | 11 - 3 self |

### BibTeX

@INPROCEEDINGS{Cilibrasi06anew,

author = {Rudi Cilibrasi and Paul M. B. Vitányi},

title = {A New Quartet Tree Heuristic for Hierarchical Clustering},

booktitle = {EUPASCAL Statistics and Optimization of Clustering Workshop, 5-6 Juli 2005},

year = {2006},

pages = {5--6}

}

### OpenURL

### Abstract

We consider the problem of constructing an an optimal-weight tree from the 3 ` n 4 ´ weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1

### Citations

2286 | The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406–425 - Saitou, Nei - 1987 |

1681 | An Introduction to Kolmogorov Complexity and its Applications
- Li, Vitányi
- 1993
(Show Context)
Citation Context ...stribution p(k) = c/(k(log k) 2 ) is as close to the edge as is reasonable, and because the used coding x → x ′′ is a prefix code we have � 1/(k(logk) 2 ) ≤ 1 by the Kraft Inequality (see for ecample =-=[29]-=-) and therefore c ≥ 1. Let us see what this means for our algorithm using the choosen distribution p(k). For N = 64, say, we can change any tree in T to any other tree in T with a 64-mutation. The pro... |

955 |
Evolutionary trees from DNA sequences : A maximum likelihood approach
- Felsenstein
- 1981
(Show Context)
Citation Context ...) Incrementally grow the tree in random order by stepwise addition of objects in the current optimal way, repeat this for different object orders, and add agreement values on the branches, like DNAML =-=[15]-=-, or quartet puzzling [38]. (ii) Approximate the global optimum monotonically or compute it, using geometric algorithm or dynamic programming [3], and linear programming [41]. These methods, other met... |

201 |
Nonmetric multidimensional scaling: a numerical method
- Kruskal
- 1964
(Show Context)
Citation Context ...tortion to be minimized, [14]. Let the original set of distances be d1, . . . , dk and the projected distances be d ′ 1 , . . .,d′ k . In Figure 13 we used the distortion measure Kruskall’s stress-1, =-=[22]-=-, which minimizes � ( � i≤k (di − d ′ i )2 )/ � i≤k d2i . Kruskall’s stress-1 equal 0 means no distortion, and the worst value is at most 1 (unless you have a really bad projection). In the projection... |

192 |
The recovery of trees from measures of dissimilarity
- Buneman
- 1971
(Show Context)
Citation Context ...0) = 1. In that case, the MQTC Problem reduces to finding that T0. However, the situation turns out to be more complex. Note first that the set of quartet topologies uniquely determines a tree in T , =-=[6]-=-. Lemma 4.6 Let T, T ′ be different labeled trees in T and let QT,QT ′ be the sets of embedded quartet topologies, respectively. Then, QT �= QT ′. A complete set of quartet topologies on N is a set co... |

186 | The similarity metric
- Li, Chen, et al.
- 2003
(Show Context)
Citation Context ...nce matrix using our novel quartet-based heuristic. To check the new quartet tree method in action we use a new compression-based distance, called NCD . This metric distance was co-developed by us in =-=[26, 27, 28]-=-, as a normalized version of the “information metric” of [29, 1]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea bei... |

179 | Clustering by compression
- Cilibrasi, Vitanyi
- 2005
(Show Context)
Citation Context ...hod with a more standard method of two-dimensional clustering (to show that our dendrogram method of depicting the clusters is more informative). The new method was developed as an auxiliary tool for =-=[10, 11]-=-, since the available quartet tree methods were too slow when they were exact, and too inaccurate or uncertain when they were statistical incremental. Our new quartet tree heuristic runs orders of mag... |

136 |
Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: 798804
- ROKAS, WILLIAMS, et al.
- 2003
(Show Context)
Citation Context ...ious species become available, it has become possible to do whole genome phylogeny (this overcomes the problem that using different targeted parts of the genome, or proteins, may give different trees =-=[34]-=-). Traditional phylogenetic methods on individual genes depended on multiple alignment of the related proteins and on the model of evolution of individual amino acids. Neither of these is practically ... |

118 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...in of application are used. We believe that there is no other method known that can cluster data that is so heterogenous this reliably. This is borne out by the massive experiments with the method in =-=[18]-=-. 9 Testing on Natural Data Like most hierarchical clustering methods for natural data, the quartet tree method has been developed in the biological setting to determine phylogeny trees from genomic d... |

114 | Hierarchical genetic algorithms operating on populations of computer programs - Koza - 1989 |

103 |
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
- Li, Badger, et al.
(Show Context)
Citation Context ...nce matrix using our novel quartet-based heuristic. To check the new quartet tree method in action we use a new compression-based distance, called NCD . This metric distance was co-developed by us in =-=[26, 27, 28]-=-, as a normalized version of the “information metric” of [29, 1]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea bei... |

89 |
Haeseler, Quartet Puzzling: a quartet Maximum Likelihood method for reconstructing tree topologies
- Strimmer, von
- 1996
(Show Context)
Citation Context ...y good results in practice. Relation with Previous Work: The Minimum Quartet Tree Cost (MQTC) problem below for which we give a new computational heuristic is related to the Quartet Puzzling problem, =-=[38]-=-. There, the quartet topologies are provided with a probability value, and for each quartet the topology with the highest probability is selected (randomly, if there are more than one) as the maximum-... |

62 | Shared Information and Program Plagiarism Detection
- Chen, Francia, et al.
(Show Context)
Citation Context ...based on whole mitochondrial genomes, [26, 27, 28], a completely automatic construction of a language tree for over 50 Euro-Asian languages [28], detects plagiarism in student programming assignments =-=[8]-=-, gives phylogeny of chain letters [2], and clusters music [10], Analyzing network traffic and worms using compression [40], and many more topics [11]. The method turns out to be robust under change o... |

53 | Algorithmic clustering of music based on string compression - Cilibrasi, Vitanyi, et al. - 2004 |

53 |
A novel coronavirus associated with severe acute respiratory syndrome
- Ksiazek, Erdman, et al.
- 2003
(Show Context)
Citation Context ...uted using the compressor bzip2. The relations in Figure 9 are very similar to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of Medicine, =-=[23]-=-. We depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since the former is more precise for visual inspection of proximity relations. More recently, we download... |

46 | A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract - Berry, Bryant, et al. |

42 | Quartet cleaning: improved algorithms and simulations
- Berry, Jiang, et al.
- 1999
(Show Context)
Citation Context ..., using geometric algorithm or dynamic programming [3], and linear programming [41]. These methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects =-=[41, 30, 33, 4]-=- directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [35], as in [30, 33].... |

36 |
Tree structure for proximity data
- Colonius, Schultze
- 1981
(Show Context)
Citation Context ...n T , and hence S(T0) < 1 for T0 ∈ T realizing the MQTC optimum. For an explicit example of this, we use that a complete set corresponding to a tree in T must satisfy certain transitivity properties, =-=[12, 13]-=-: Lemma 4.7 Let T be a tree in the considered class with leaves N, Q the set of quartet topologies and Q0 ⊆ Q. Then Q0 uniquely determines T if (i) Q0 contains precisely one quartet topology for every... |

31 | A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application
- Jiang, Kearney, et al.
- 2000
(Show Context)
Citation Context ...tet method is to find (or approximate as closely as possible) the tree that embeds the maximal number of consistent (possibly weighted) quartet topologies from a given set P ⊆ Q of quartet topologies =-=[17]-=- (Figure 2). A weight function W : P → R, with R the set of real numbers determines the weights. The unweighted case is when W(uv|wx) = 1 for all uv|wx ∈ P. Definition 3.2 The (weighted) Maximum Quart... |

30 |
Chain letters and evolutionary histories
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ...[26, 27, 28], a completely automatic construction of a language tree for over 50 Euro-Asian languages [28], detects plagiarism in student programming assignments [8], gives phylogeny of chain letters =-=[2]-=-, and clusters music [10], Analyzing network traffic and worms using compression [40], and many more topics [11]. The method turns out to be robust under change of the underlying compressor-types: sta... |

29 |
Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders
- Cao, Janke, et al.
- 1998
(Show Context)
Citation Context ...terial has become available, the debate in biology has intensified concerning which two of the three main groups of placental mammals are more closely related: Primates, Ferungulates, and Rodents. In =-=[7]-=-, the maximum likelihood method of phylogeny tree reconstruction gave evidence for the (Ferungulates, (Primates, Rodents)) grouping for half of the proteins in the mitochondial genomes investigated, a... |

24 | Constructing phylogenies from quartets: elucidation of Eutherian superordinal relationships
- Ben-Dor, Chor, et al.
- 1998
(Show Context)
Citation Context ... and add agreement values on the branches, like DNAML [15], or quartet puzzling [38]. (ii) Approximate the global optimum monotonically or compute it, using geometric algorithm or dynamic programming =-=[3]-=-, and linear programming [41]. These methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [41, 30, 33, 4] directly, even while using farms of de... |

18 |
The CompLearn Toolkit
- Cilibrasi
- 2003
(Show Context)
Citation Context ...in about a minute. Additionally, the space of trees is large, so the algorithm may slow down substantially. For larger experiments, we used the C program called partree (part of the CompLearn package =-=[9]-=-) with MPI (Message Passing Interface, a common standard used on massively parallel computers) on a cluster of workstations in parallel to find trees more rapidly. We can consider the graph mapping th... |

17 | Performance of supertree methods on various dataset decompositions
- Roshan, Moret, et al.
- 2004
(Show Context)
Citation Context ...jects [41, 30, 33, 4] directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, =-=[35]-=-, as in [30, 33]. In 2003 in [10, 11] we considered a new approach, like [41], and possibly predating it. Our goal was to use a quartet method to obtain high-quality hierarchical clustering of data fr... |

14 | Hierarchical clustering based on mutual information - Kraskov, Stögbauer, et al. - 2003 |

14 | Algorithmic Complexity
- Li, Vitányi
(Show Context)
Citation Context ...nce matrix using our novel quartet-based heuristic. To check the new quartet tree method in action we use a new compression-based distance, called NCD . This metric distance was co-developed by us in =-=[26, 27, 28]-=-, as a normalized version of the “information metric” of [29, 1]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea bei... |

10 |
Marsupials and eutherians reunited: genetic evidence for the theria hypothesis of mammalian evolution
- Killian, Buckley, et al.
(Show Context)
Citation Context ...ing pouches), and Eutheria (placental mammals: mammals that procreate using placentas). The sister relationships between these groups is viewed as the most fundamental question in mammalian evolution =-=[19]-=-. Phylogenetic comparison by either anatomy or mitochondrial genome has resulted in two conflicting hypotheses: the gene-isolation-supported Marsupionta hypothesis: ((Prototheria, Metatheria), Eutheri... |

10 | A philosophical essay on probabilities, 1819. English translation - Laplace - 1951 |

10 |
Analyzing network traffic and worms using compression
- Wehner
(Show Context)
Citation Context ...o-Asian languages [28], detects plagiarism in student programming assignments [8], gives phylogeny of chain letters [2], and clusters music [10], Analyzing network traffic and worms using compression =-=[40]-=-, and many more topics [11]. The method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), or special ... |

9 |
Arnason, Phylogenetic analysis of 18S rRNA and the mitochondrial genomes of wombat, Vombatus ursinus, and the spiny anteater, Tachyglossus acelaetus: increased support for the Marsupionta hypothesis
- Janke, Magnell, et al.
(Show Context)
Citation Context ...a hypothesis was given in [19] by analyzing a large nuclear gene (M6P/IG2R), viewed as important across the species concerned, and even more recent support for the Marsupionta hypothesis was given in =-=[16]-=- by phylogenetic analysis of another sequence from the nuclear gene (18S rRNA) and by the whole mitochondrial genome. Experimental Evidence: To test the Eutherian orders simultaneously with the Marsup... |

7 | Trees constructed from empirical relations, Braunschweiger Berichte aus dem Institut fuer Psychologie 1 - Colonius, Schultze - 1977 |

7 |
Integer linear programming as a tool for constructing trees from quartet data. Preprint from the web submitted to Elsevier Science
- Weyer-Menkoff, Devauchelle, et al.
- 1991
(Show Context)
Citation Context ... the branches, like DNAML [15], or quartet puzzling [38]. (ii) Approximate the global optimum monotonically or compute it, using geometric algorithm or dynamic programming [3], and linear programming =-=[41]-=-. These methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [41, 30, 33, 4] directly, even while using farms of desktops. To handle more object... |

6 | Quartet methods for phylogeny reconstruction from gene orders
- Liu, Tang, et al.
- 2005
(Show Context)
Citation Context ..., using geometric algorithm or dynamic programming [3], and linear programming [41]. These methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects =-=[41, 30, 33, 4]-=- directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [35], as in [30, 33].... |

5 |
A discipline of evolutionary programming, Theoret
- Vitanyi
- 2000
(Show Context)
Citation Context ...hat there is no polynomial approximation scheme for MQTC optimization, and whether our scheme is expected polynomial time seems to require proving that the involved Metropolis chain is rapidly mixing =-=[39]-=-, a notoriously hard and generally unsolved problem. In practice, in our experiments there is unanymous evidence that for the natural data and the weighting function we have used, convergence is alway... |

3 |
Quartet supertrees. Chapter 4
- Piaggio-Talice, Burleigh, et al.
- 2004
(Show Context)
Citation Context ..., using geometric algorithm or dynamic programming [3], and linear programming [41]. These methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects =-=[41, 30, 33, 4]-=- directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [35], as in [30, 33].... |

3 |
The complexity of reconstructiong trees form qualitative characters and subtrees
- Steel
(Show Context)
Citation Context ...es and QT be the set of quartet topologies embedded in T. Given a set of quartet topologies P ⊆ Q, and an integer k, the problem is to decide whether there is a binary tree T such that P � QT > k. In =-=[37]-=- it is shown that the MQC decision problem is NP-hard. We have formulated the NP-hardness of the so-called incomplete MQC decision problem, the less general complete MQC decision problem requires P to... |

1 |
Recombinomics website
- Niman
(Show Context)
Citation Context ...t have been hypothesized or some other genetic intermediate. There is also throughout the tree substantial agreement with (and independent verification of) independent experts like Dr. Henry L. Niman =-=[32]-=- on every specific point regarding genetic similarity. The technique provides here an easy verification procedure without much work. 9.2 Music The amount of digitized music available on the internet h... |