Results 1  10
of
11
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 182 (24 self)
 Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Automatic Extraction of Meaning from the Web
 IEEE International Symposium on Information Theory
, 2006
"... Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures c ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
The Normalized Compression Distance Is Resistant to Noise
, 2007
"... This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, wh ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
This correspondence studies the influence of noise on the normalized compression distance (NCD), a measure based on the use of compressors to compute the degree of similarity of two files. This influence is approximated by a first order differential equation which gives rise to a complex effect, which explains the fact that the NCD may give values greater than 1, observed by other authors. The model is tested experimentally with good adjustment. Finally, the influence of noise on the clustering of files of different types is explored, finding that the NCD performs well even in the presence of quite high noise levels.
Similarity of objects and the meaning of words
 In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS
, 2006
"... Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for tha ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. 1
Clustering
, 2009
"... The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We pr ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (nonphylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.
Universal similarity
 in Proc. IEEE ISOC ITW2005 on Coding and Complexity
, 2005
"... Abstract — We survey a new area of parameterfree similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes e ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract — We survey a new area of parameterfree similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
1 Information Distance: New Developments
"... In pattern recognition, learning, and data mining one obtains information from informationcarrying objects. This involves an objective definition of the information in a single object, the information to go from one object to another object in a pair of objects, the information to go from one objec ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
In pattern recognition, learning, and data mining one obtains information from informationcarrying objects. This involves an objective definition of the information in a single object, the information to go from one object to another object in a pair of objects, the information to go from one object to any other object in a multiple of objects, and the shared information between objects. This is called “information distance. ” We survey a selection of new developments in information distance. I. The Case n = 2 The clustering we use is hierarchical clustering in dendrograms based on a new fast heuristic for the quartet method [5]. If we consider n objects, then we find n 2 pairwise distances. These distances are between natural data. We let the data decide for themselves, and construct a hierarchical clustering of the n objects concerned. For details see the cited reference. The method takes the n×n distance matrix as input, and yields a dendrogram with the n objects as leaves (so the dendrogram contains n external nodes or leaves and n−2 internal nodes. We assume n ≥ 4. The method is available as an opensource software tool, [2]. Our aim is to capture, in a single similarity metric, every effective distance: effective versions of Hamming distance, Euclidean distance, edit distances, alignment distance, LempelZiv distance, and so on. This metric should be so general that it works in every domain: music,
1 Compressionbased Similarity
"... First we consider pairwise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned, normalized, and can be viewed as similarity distances. Second, w ..."
Abstract
 Add to MetaCart
First we consider pairwise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned, normalized, and can be viewed as similarity distances. Second, we consider pairwise distances between names of objects, like “red ” or “christianity. ” In this case the distances are based on searches of the Internet. Such a search can be performed by any search engine that returns aggregate page counts. We can extract a code length from the numbers returned, use the same formula as before, and derive a similarity or relative semantics between names for objects. The theory is based on Kolmogorov complexity. We test both similarities extensively experimentally. I.