## Clustering (2009)

Citations: | 2 - 0 self |

### BibTeX

@MISC{Cilibrasi09clustering,

author = {Rudi Cilibrasi and Paul M. B. Vitányi},

title = {Clustering},

year = {2009}

}

### OpenURL

### Abstract

The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (non-phylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.

### Citations

2302 |
The neighbor joining method: a new method for reconstructing phylogenetic trees
- Saitou, Nei
- 1987
(Show Context)
Citation Context ...eral trials it failed to produce an answer at all (throwing an unhandled Java Exception), which may be due to an implementation problem. Therefore, attention was focussed on the other two methods. NJ =-=[37]-=- and BioNJ [19] are neighbor-joining type methods. In all tested cases they produced the same trees, therefore we will treat them as the same in this discussion. A. Testing on Artificial Data We first... |

1690 | Kolmogorov complexity and its applications
- Li, Vitanyi
- 1990
(Show Context)
Citation Context ...tribution p(k) = c/(k(log k) 2 ) is as close to the edge as is reasonable, and because the used coding x → x ′′ is a prefix code we have ∑ 1/(k(log k) 2 ) ≤ 1 by the Kraft Inequality (see for example =-=[31]-=-) and therefore c ≥ 1. Let us see what this means for our algorithm using the chosen distribution p(k). For N = 64, say, we can change any tree in T to any other tree in T with a 64-mutation. The prob... |

960 |
Evolutionary trees from DNA sequences: a maximum likelihood approach
- Felsenstein
- 1981
(Show Context)
Citation Context ...) Incrementally grow the tree in random order by stepwise addition of objects in the locally optimal way, repeat this for different object orders, and add agreement values on the branches, like DNAML =-=[18]-=-, or quartet puzzling [41]. These methods are fast, but a possible problem is as follows. Suppose we have just 32 items. With quartet puzzling we incrementally construct an quartet tree from a randoml... |

193 |
The recovery of trees from measures of dissimilarity. In
- Buneman
- 1971
(Show Context)
Citation Context ...at case, the MQTC optimization problem reduces to finding that T0. However, the situation turns out to be more complex. Note first that the set of quartet topologies uniquely determines a tree in T , =-=[6]-=-. Lemma 3.9: Let T, T ′ be different labeled trees in T and let QT, QT ′ be the sets of embedded quartet topologies, respectively. Then, QT ̸= QT ′. A complete set of quartet topologies on N is a set ... |

187 | The similarity metric
- Li, Chen, et al.
(Show Context)
Citation Context ...iT Programmme, and the EU NoE PASCAL, the Netherlands BSIK/BRICKS project. May 22, 2009 DRAFTdoes not address the problem of how to obtain the quartet topology weights from sequence data [22], [28], =-=[30]-=-, but takes as input the weights of all quartet topologies and executes the step of how to reconstruct the phylogeny from there. The algorithm produces a sequence of candidate trees with the objects a... |

152 |
BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data
- Gascuel
- 1997
(Show Context)
Citation Context ...failed to produce an answer at all (throwing an unhandled Java Exception), which may be due to an implementation problem. Therefore, attention was focussed on the other two methods. NJ [37] and BioNJ =-=[19]-=- are neighbor-joining type methods. In all tested cases they produced the same trees, therefore we will treat them as the same in this discussion. A. Testing on Artificial Data We first test whether t... |

118 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...mains [11]. It is in fact a parameter-free, feature-free, datamining tool. It has been experimentally tested on all time sequence data used in all the major data-mining conferences in the last decade =-=[23]-=-. Comparing the compression method with all major methods used in those conferences they established clear superiority of the compression method for clustering heterogeneous data, and for anomaly dete... |

115 | Hierarchical genetic algorithms operating on populations of computer programs - Koza - 1989 |

103 |
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
- Li, Badger, et al.
(Show Context)
Citation Context ... ESF QiT Programmme, and the EU NoE PASCAL, the Netherlands BSIK/BRICKS project. May 22, 2009 DRAFTdoes not address the problem of how to obtain the quartet topology weights from sequence data [22], =-=[28]-=-, [30], but takes as input the weights of all quartet topologies and executes the step of how to reconstruct the phylogeny from there. The algorithm produces a sequence of candidate trees with the obj... |

89 |
Haeseler, Quartet Puzzling: a quartet Maximum Likelihood method for reconstructing tree topologies
- Strimmer, von
- 1996
(Show Context)
Citation Context ...eveloped a new approach. A. Relation with Previous Work: The Minimum Quartet Tree Cost (MQTC) problem below for which we give a new computational heuristic is related to the Quartet Puzzling problem, =-=[41]-=-. There, the quartet topologies are provided with a probability value, and for each quartet the topology with the highest probability is selected (randomly, if there are more than one) as the maximum-... |

62 | Shared Information and Program Plagiarism Detection
- Chen, Francia, et al.
(Show Context)
Citation Context ...dard Biological packages have been applied to, among others, alignmentfree whole genome phylogeny, [28], [29], [30], chain letter phylogeny [2], constructing language trees [30], plagiarism detection =-=[7]-=-. The NCD method is also used for general clustering and classification of natural data in arbitrary domains, for clustering of heterogeneous data, and for anomaly detection across domains [11]. It is... |

53 | Algorithmic clustering of music based on string compression
- Cilibrasi, Vitanyi, et al.
- 2004
(Show Context)
Citation Context ...rformance and running time with those of UPGMA, BioNJ, and NJ, as iplemented in the SplitsTree package. The method was developed as part of the CompLearn software [8], and used in, among many others, =-=[10]-=-, [11], [12]. We focussed on a quartet method for tree reconstruction believing it to be more sensitive and objective than other methods. Since the available quartet tree methods were too slow when th... |

53 |
A novel coronavirus associated with severe acute respiratory syndrome
- Ksiazek, Erdman, et al.
- 2003
(Show Context)
Citation Context ...ing the compressor bzip2. The relations with S(T) = 0.988 were very similar to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of Medicine, =-=[25]-=-. In [9], 100 different H5N1 sample genomes were downloaded from the NCBI/NIH database online, to analyze the geographical spreading of the Bird Flu H5N1 Virus in a large example. In general hierarchi... |

46 | A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract - Berry, Bryant, et al. |

42 | Quartet cleaning: improved algorithms and simulations
- Berry, Jiang, et al.
- 1999
(Show Context)
Citation Context ...linear programming [44], or semi-definite programming [39]. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], [32], [34], =-=[4]-=-, [39] directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [36], as in [32... |

36 |
Tree structure for proximity data
- Colonius, Schultze
- 1981
(Show Context)
Citation Context ...and hence S(T0) < 1 for T0 ∈ T realizing the MQTC optimum. For an explicit example of this, we use that a complete set corresponding to a tree in T must satisfy certain transitivity properties, [13], =-=[14]-=-: Lemma 3.10: Let T be a tree in the considered class with leaves N, Q the set of quartet topologies and Q0 ⊆ Q. Then Q0 uniquely determines T if (i) Q0 contains precisely one quartet topology for eve... |

31 | A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application
- Jiang, Kearney, et al.
- 2000
(Show Context)
Citation Context ...tet method is to find (or approximate as closely as possible) the tree that embeds the maximal number of consistent (possibly weighted) quartet topologies from a given set P ⊆ Q of quartet topologies =-=[21]-=- (Figure 2). A weight function W : P → R, with R the set of real numbers determines the weights. The unweighted case is when W(uv|wx) = 1 for all uv|wx ∈ P . Definition 2.2: The (weighted) Maximum Qua... |

30 |
Chain letters and evolutionary histories
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ... of the NCD using phylogeny reconstruction methods from standard Biological packages have been applied to, among others, alignmentfree whole genome phylogeny, [28], [29], [30], chain letter phylogeny =-=[2]-=-, constructing language trees [30], plagiarism detection [7]. The NCD method is also used for general clustering and classification of natural data in arbitrary domains, for clustering of heterogeneou... |

26 | Ordinal quartet method, in
- Kearney
- 1998
(Show Context)
Citation Context ...4, the ESF QiT Programmme, and the EU NoE PASCAL, the Netherlands BSIK/BRICKS project. May 22, 2009 DRAFTdoes not address the problem of how to obtain the quartet topology weights from sequence data =-=[22]-=-, [28], [30], but takes as input the weights of all quartet topologies and executes the step of how to reconstruct the phylogeny from there. The algorithm produces a sequence of candidate trees with t... |

24 | Constructing phylogenies from quartets: elucidation of Eutherian superordinal relationships
- Ben-Dor, Chor, et al.
- 1998
(Show Context)
Citation Context ...ntative sample from which we can conclude anything about the globally optimal tree? (ii) Approximate the global optimum monotonically or compute it, using a geometric algorithm or dynamic programming =-=[3]-=-, linear programming [44], or semi-definite programming [39]. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], [32], [34]... |

21 | Analyzing worms and network traffic using compression
- Wehner
(Show Context)
Citation Context ...out by the massive experiments with the method in [23]. In [10] we used MIDI data to cluster classical music, distinguish between genres like pop, rock, and classical, and do music classification. In =-=[43]-=-, the CompLearn package was used to analyze network traffic and to cluster computer worms and virusses. CompLearn was used to analyze medical clinical data in clustering fetal heart rate tracings [16]... |

18 |
The CompLearn Toolkit
- Cilibrasi
- 2003
(Show Context)
Citation Context ... and natural data sets. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as iplemented in the SplitsTree package. The method was developed as part of the CompLearn software =-=[8]-=-, and used in, among many others, [10], [11], [12]. We focussed on a quartet method for tree reconstruction believing it to be more sensitive and objective than other methods. Since the available quar... |

17 | Performance of supertree methods on various dataset decompositions
- Roshan, Moret, et al.
- 2004
(Show Context)
Citation Context ...[32], [34], [4], [39] directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, =-=[36]-=-, as in [32], [34]. B. This Work In 2003 in [10], [11] we considered a new approach, like [44], and possibly predating it. Our goal was to use a quartet method to obtain high-quality hierarchical clus... |

14 | Algorithmic Complexity
- Li, Vitányi
(Show Context)
Citation Context ...omparisons between phylogeny reconstruction algorithms that take distance matrices as input, we use a new compression-based distance, called NCD . This metric distance was co-developed by us in [28], =-=[29]-=-, [30], as a normalized version of the “information metric” of [31], [1]. The mathematics used is based on Kolmogorov complexity theory [31], which is approximated using real-world compression softwar... |

10 |
Statistical Inference through Data Compression
- Cilibrasi
- 2007
(Show Context)
Citation Context ...ompressor bzip2. The relations with S(T) = 0.988 were very similar to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of Medicine, [25]. In =-=[9]-=-, 100 different H5N1 sample genomes were downloaded from the NCBI/NIH database online, to analyze the geographical spreading of the Bird Flu H5N1 Virus in a large example. In general hierarchical clus... |

10 |
Clustering fetal heart rate tracings by compression
- Santos, Bernardes, et al.
(Show Context)
Citation Context ...[43], the CompLearn package was used to analyze network traffic and to cluster computer worms and virusses. CompLearn was used to analyze medical clinical data in clustering fetal heart rate tracings =-=[16]-=-. Other applications by different authors are in software metrics and obfuscation, web page authorship, topic and domain identification, protein sequence/structure classification, phylogenetic reconst... |

10 | A philosophical essay on probabilities, 1819. English translation - Laplace - 1951 |

7 |
Trees constructed from empirical relations, Braunschweiger Berichte aus dem Institut fuer Psychologie 1
- Colonius, Schultze
- 1977
(Show Context)
Citation Context ...n T , and hence S(T0) < 1 for T0 ∈ T realizing the MQTC optimum. For an explicit example of this, we use that a complete set corresponding to a tree in T must satisfy certain transitivity properties, =-=[13]-=-, [14]: Lemma 3.10: Let T be a tree in the considered class with leaves N, Q the set of quartet topologies and Q0 ⊆ Q. Then Q0 uniquely determines T if (i) Q0 contains precisely one quartet topology f... |

7 |
Integer linear programming as a tool for constructing trees from quartet data. Preprint from the web submitted to Elsevier Science
- Weyer-Menkoff, Devauchelle, et al.
- 1991
(Show Context)
Citation Context ...h we can conclude anything about the globally optimal tree? (ii) Approximate the global optimum monotonically or compute it, using a geometric algorithm or dynamic programming [3], linear programming =-=[44]-=-, or semi-definite programming [39]. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], [32], [34], [4], [39] directly, eve... |

6 | Quartet methods for phylogeny reconstruction from gene orders
- Liu, Tang, et al.
- 2005
(Show Context)
Citation Context ...amming [3], linear programming [44], or semi-definite programming [39]. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], =-=[32]-=-, [34], [4], [39] directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [36]... |

6 |
quartet puzzling: a new quartetbased phylogeny reconstruction algorithm
- Snir, Warnow, et al.
- 2008
(Show Context)
Citation Context ...e globally optimal tree? (ii) Approximate the global optimum monotonically or compute it, using a geometric algorithm or dynamic programming [3], linear programming [44], or semi-definite programming =-=[39]-=-. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], [32], [34], [4], [39] directly, even while using farms of desktops. To... |

5 |
A discipline of evolutionary programming, Theoret
- Vitanyi
- 2000
(Show Context)
Citation Context ...hat there is no polynomial approximation scheme for MQTC optimization, and whether our scheme is expected polynomial time seems to require proving that the involved Metropolis chain is rapidly mixing =-=[42]-=-, a notoriously hard and generally unsolved problem. In practice, in our experiments there is unanimous evidence that for the natural data and the cost function we have used, convergence is always fas... |

4 | Heuristic approaches for the quartet method of hierarchical clustering
- Consoli, Darby-Dowman, et al.
(Show Context)
Citation Context ..., since in this case the costs of different quartet topologies depends on one another, leading to an O(n2 ) per generation algorithm. This way, one can attack problems of up to 200 objects. Recently, =-=[15]-=- has used various other heuristics different from the one presented here to obtain a method that is both faster and yields better results than the initial heuristic in this paper. D. Tree Building Sta... |

3 |
Quartet supertrees. Chapter 4
- Piaggio-Talice, Burleigh, et al.
- 2004
(Show Context)
Citation Context ... [3], linear programming [44], or semi-definite programming [39]. These latter methods, other methods, as well as methods related to the MQT problem, cannot handle more than 15–30 objects [44], [32], =-=[34]-=-, [4], [39] directly, even while using farms of desktops. To handle more objects one needs to construct a supertree from the constituent quartet trees for subsets of the original data sets, [36], as i... |

3 |
The complexity of reconstructiong trees form qualitative characters and subtrees
- Steel
(Show Context)
Citation Context ...logies embedded in T . Definition 3.6: The MQC decision problem is the following: GIVEN: A set of quartet topologies P ⊆ Q, and an integer k. DECIDE: Is there a binary tree T such that P ⋂ QT > k. In =-=[40]-=- it is shown that the MQC decision problem is NP–hard. Sometimes this problem is called the incomplete MQC decision problem. The less general complete MQC decision problem requires P to contain precis... |