Results 1  10
of
46
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 296 (26 self)
 Add to MetaCart
(Show Context)
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
Algorithmic clustering of music based on string compression
 COMPUTER MUSIC JOURNAL
, 2004
"... All musical pieces are similar, but some are more similar than others. Apart from serving as an infinite source of discussion (‘‘Haydn is just like Mozart—No, he’s not!’’), such similarities are also crucial for the design of efficient music information retrieval systems. The amount of digitized mus ..."
Abstract

Cited by 67 (19 self)
 Add to MetaCart
All musical pieces are similar, but some are more similar than others. Apart from serving as an infinite source of discussion (‘‘Haydn is just like Mozart—No, he’s not!’’), such similarities are also crucial for the design of efficient music information retrieval systems. The amount of digitized music available on the Internet has grown dramatically in recent years, both in the public domain and on commercial sites; Napster and its clones are prime examples. Web sites offering musical content in some form like MP3, MIDI, or other, need a way to organize their wealth of material; they need to somehow classify their files according to musical genres and subgenres, putting similar pieces together. The purpose of such organization is to enable users to navigate to pieces of music they already know and like, but also to give them advice and recommendations (‘‘If you like this, you might also like...’’). Currently, such organization is mostly done manually by humans, or based on patterns in the purchasing behaviors of customers. However, some recent research has been examining the possibilities of automating music classification. A human expert, comparing different pieces of music with the goal of clustering similar works together, will generally look for certain specific similarities. Previous attempts to automate this process do the same. Generally speaking, they take a file containing a piece of music and extract from it various specific numerical features, related to pitch, rhythm, harmony, etc. One can extract such features using, for instance, Fourier transforms (Tzanetakis and Cook 2002) or wavelet transforms
Performance study of phylogenetic methods: (unweighted) quartet methods and neighborjoining
, 2003
"... ..."
(Show Context)
Algorithmic clustering of music
 Computer Music Journal
, 2004
"... We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literat ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
(Show Context)
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literature, and genomics. Indeed, it can be used to simultaneously cluster objects from completely different domains, like with like. It is based on an ideal theory of the information content in individual objects (Kolmogorov complexity), information distance, and a universal similarity metric. The approximation to the universal similarity metric obtained using standard data compressors is called “normalized compression distance (NCD). ” Experiments using our CompLearn software tool show that the method distinguishes between various musical genres and can even cluster pieces by composer. 1.
A FixedParameter Algorithm for Minimum Quatet Inconsistency
 Journal of Computer and System Sciences
, 2003
"... Given n taxa, exactly one topology for every subset of four taxa, and a positive integer k (the parameter), the Minimum Quartet Inconsistency (MQI) problem is the question whether we can find an evolutionary tree inducing a set of quartet topologies that di#ers from the given set in only k quarte ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Given n taxa, exactly one topology for every subset of four taxa, and a positive integer k (the parameter), the Minimum Quartet Inconsistency (MQI) problem is the question whether we can find an evolutionary tree inducing a set of quartet topologies that di#ers from the given set in only k quartet topologies. The more general problem where we are not necessarily given a topology for every subset of four taxa appears to be fixedparameter intractable. For MQI, however, which is also NPcomplete, we can compute the required tree in time O(4 ). This means that the problem is fixedparameter tractable and that in the case of a small number k of "errors" the tree reconstruction can be done e#ciently. In particular, for minimal k, our algorithm can produce all solutions that resolve k errors. Additionally, we discuss significant heuristic improvements.
Constructing optimal trees from quartets
 Journal of Algorithms
, 2001
"... We present fast new algorithms for constructing phylogenetic trees from quartets Ž resolved trees on four leaves.. The problem is central to divideandconquer approaches to phylogenetic analysis and has been receiving considerable attention from the computational biology community. Most formulation ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
We present fast new algorithms for constructing phylogenetic trees from quartets Ž resolved trees on four leaves.. The problem is central to divideandconquer approaches to phylogenetic analysis and has been receiving considerable attention from the computational biology community. Most formulations of the problem are NPhard. Here we consider a number of constrained versions that have polynomial time solutions. The main result is an algorithm for determining bounded degree trees with optimal quartet weight, subject to the constraint that the splits in the tree come from a given collection, for example, the splits in the aligned sequence data. The algorithm can search an exponentially large number of phylogenetic trees in polynomial time. We present applications of this algorithm to a number of problems in phylogenetics, including sequence analysis, construction of trees from phylogenetic networks, and consensus methods. � 2001 Academic Press Key Words: quartets; phylogenetic trees; algorithms; consensus; networks. 1.
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
QUARTET SUPERTREES
, 2004
"... We introduce two supertree methods that produce unrooted supertrees from unrooted input trees. The methods assemble supertrees from a weighted quartet (fourtaxon) tree representation of the input trees. The first method, QLI, extends Willson’s local inconsistency quartet method to construct super ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We introduce two supertree methods that produce unrooted supertrees from unrooted input trees. The methods assemble supertrees from a weighted quartet (fourtaxon) tree representation of the input trees. The first method, QLI, extends Willson’s local inconsistency quartet method to construct supertrees. This method, which was designed originally to produce a tree from a taxoncharacter matrix, is not well suited for building accurate supertrees when there is little taxonomic overlap among the input trees. The second method, QILI, builds additionally on Willson’s quartetrectifying process and infers missing phylogenetic information from the input trees. We examined the effectiveness of the quartetsupertree methods using simulated and empirical data sets. These studies suggest that QILI is relatively accurate when compared with the matrix representation with parsimony (MRP) supertree method.
Quartet based phylogeny reconstruction with answer set programming
 In Proceedings of the 16th ICTAI
, 2004
"... Evolution is an important subarea of study in biological science, where given a set of species, the goal is to reconstruct their evolutionary history, or phylogeny. Many kinds of data associated with the species can be deployed for this task and many reconstruction methods have been proposed and ex ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
(Show Context)
Evolution is an important subarea of study in biological science, where given a set of species, the goal is to reconstruct their evolutionary history, or phylogeny. Many kinds of data associated with the species can be deployed for this task and many reconstruction methods have been proposed and examined in the literature. One very recent approach is to build a local phylogeny for every subset of 4 species, which is called a quartet for these 4 species, and then to assemble a phylogeny for the whole set of species satisfying these predicted quartets. In general, those predicted quartets might not always agree each other; and thus the objective function becomes to satisfy a maximum number of predicted quartets. This is the well known Maximum Quartet Consistency (MQC) problem, which is studied by a lot of researchers in the last two decades. In this paper, we present a new equivalent representation for the MQC problem, that is, to search for an ultrametric matrix to satisfy the maximum number of those predicted quartets. We examine a few number of structural properties of the MQC problem in this new representation, through formulating it into Answer Set Programming (ASP), a recent powerful logic programming tool for modeling and solving searching problems. The efficiency and usefulness of our approach are confirmed by our computational experiments on the artificial data as well as two real datasets. 1.