Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Cited by 179
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
ContentBased Music Information Retrieval: Current Directions and Future Challenges
, 2008
"... ..."
P.: Automatic meaning discovery using google
 Centrum Wiskunde & Informatica (CWI
, 2004
"... We have found a method to automatically extract the meaning of words and phrases from the worldwideweb using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The worldwideweb is the largest datab ..."
Cited by 36
We have found a method to automatically extract the meaning of words and phrases from the worldwideweb using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The worldwideweb is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and numbername words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, emergency incidents, and we conduct a massive experiment in understanding WordNet categories; the ability to do a simple automatic EnglishSpanish translation. 1
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Cited by 11
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Information Distance in Multiples
, 2009
"... Information distance is a parameterfree similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universal ..."
Cited by 9
Information distance is a parameterfree similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity, and normalized information distance in multiples. We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of the compressed version of the file involved, using a realworld compression program.
Automatic Extraction of Meaning from the Web
 IEEE International Symposium on Information Theory
, 2006
"... Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like "red " or "christianity. " For the first type we consider a family of computable distance measures c ..."
Cited by 9
Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
Exploring the feasibility of proactive reputations
 In: Proc. of the 5th Int’l Workshop on PeertoPeer Systems
, 2006
"... Reputation mechanisms help peers in a peertopeer (P2P) system avoid unreliable or malicious peers. In applicationlevel networks, however, short peer lifetimes mean reputations are often generated from a small number of past transactions. These reputation values are less "reliable, " and more vul ..."
Cited by 7
Reputation mechanisms help peers in a peertopeer (P2P) system avoid unreliable or malicious peers. In applicationlevel networks, however, short peer lifetimes mean reputations are often generated from a small number of past transactions. These reputation values are less “reliable, ” and more vulnerable to badmouthing or collusion attacks. We address this issue by introducing proactive reputations, a firsthand history of transactions initiated to augment incomplete or shortterm reputation values. We present several mechanisms for generating proactive reputations, along with a statistical similarity metric to measure their effectiveness. 1.
Similarity of objects and the meaning of words
 In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS
, 2006
"... Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for tha ..."
Cited by 5
Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. 1
MUSIC GENRE CLASSIFICATION USING SIMILARITY FUNCTIONS
"... We consider music classification problems. A typical machine learning approach is to use support vector machines with some kernels. This approach, however, does not seem to be successful enough for classifying music data in our experiments. In this paper, we follow an alternative approach. We employ ..."
Cited by 4
We consider music classification problems. A typical machine learning approach is to use support vector machines with some kernels. This approach, however, does not seem to be successful enough for classifying music data in our experiments. In this paper, we follow an alternative approach. We employ a (dis)similaritybased learning framework proposed by Wang et al. This (dis)similaritybased approach has a theoretical guarantee that one can obtain accurate classifiers using (dis)similarity measures under a natural assumption. We demonstrate the effectiveness of our approach in computational experiments using Japanese MIDI data. 1.