Results 1  10
of
17
Clustering by compression
 IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the l ..."
Abstract

Cited by 182 (24 self)
 Add to MetaCart
Abstract—We present a new method for clustering based on compression. The method does not use subjectspecific features or background knowledge, and works as follows: First, we determine a parameterfree, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, codeveloped by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the realworld setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on wholemitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameterfree data mining, quartet tree method, universal dissimilarity distance. I.
P.M.B.: The Google similarity distance
 IEEE Transactions on Knowledge and Data Engineering
, 2007
"... Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarit ..."
Abstract

Cited by 177 (8 self)
 Add to MetaCart
Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic EnglishSpanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories. Index Terms—Accuracy comparison with WordNet categories, automatic classification and clustering, automatic meaning discovery using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google distribution via page hit counts, Google code, Kolmogorov complexity, normalized compression distance (NCD), normalized information distance (NID), normalized Google distance (NGD), meaning of words and phrases extracted from the Web, parameterfree data mining, universal similarity metric. Ç 1
Shared Information and Program Plagiarism Detection
 IEEE TRANS. INFORM. TH
"... A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric i ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have
P.: Automatic meaning discovery using google
 Centrum Wiskunde & Informatica (CWI
, 2004
"... We have found a method to automatically extract the meaning of words and phrases from the worldwideweb using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The worldwideweb is the largest datab ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
We have found a method to automatically extract the meaning of words and phrases from the worldwideweb using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The worldwideweb is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and numbername words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, emergency incidents, and we conduct a massive experiment in understanding WordNet categories; the ability to do a simple automatic EnglishSpanish translation. 1
Algorithmic clustering of music
 Computer Music Journal
, 2004
"... We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literat ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification, literature, and genomics. Indeed, it can be used to simultaneously cluster objects from completely different domains, like with like. It is based on an ideal theory of the information content in individual objects (Kolmogorov complexity), information distance, and a universal similarity metric. The approximation to the universal similarity metric obtained using standard data compressors is called “normalized compression distance (NCD). ” Experiments using our CompLearn software tool show that the method distinguishes between various musical genres and can even cluster pieces by composer. 1.
P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering arXiv:cs/0606048
, 2006
"... We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal to ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
We consider the problem of constructing an an optimalweight tree from the 3 () n weighted quartet 4 topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a heuristic for reconstructing the optimalweight tree, and a canonical manner to derive the quartettopology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values. We do not assume that there exists a true bifurcating supertree that embeds each quartet in the optimal topology, or represents the distance matrix faithfully—not even under the assumption that the weights or distances are corrupted by a measuring process. Our aim is to hierarchically cluster the input data as faithfully as possible, both phylogenetic data and data of completely different types. In our experiments with natural data, like genomic data, texts or music, the global optimum appears to be reached. Our method is capable of handling over 100 objects, possibly up to 1000 objects, while no existing quartet heuristic can computionally approximate the exact optimal solution of a quartet tree of more than about 20–30 objects without running for years. The method is implemented and available as public software. 1
Causal inference using the algorithmic Markov condition
, 2008
"... Inferring the causal structure that links n observables is usually basedupon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to g ..."
Abstract

Cited by 11 (11 self)
 Add to MetaCart
Inferring the causal structure that links n observables is usually basedupon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when only single observations are present. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace the notion of conditional stochastic independence in the causal Markov condition with the vanishing of conditional algorithmic mutual information anddescribe the corresponding causal inference rules. We explain why a consistent reformulation of causal inference in terms of algorithmic complexity implies a new inference principle that takes into account also the complexity of conditional probability densities, making it possible to select among Markov equivalent causal graphs. This insight provides a theoretical foundation of a heuristic principle proposed in earlier work. We also discuss how to replace Kolmogorov complexity with decidable complexity criteria. This can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical independence with practical independence tests that are based on implicit or explicit assumptions on the underlying distribution. email:
Automatic Extraction of Meaning from the Web
 IEEE International Symposium on Information Theory
, 2006
"... Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures c ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract — We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. I.
A simple and fast DNA compressor
 Software  Practice and Experience
, 2004
"... In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. W ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an encoding designed to take advantage of the possible presence of approximate repeats. Our approach leads to an algorithm which is an order of magnitude faster than any other algorithm and achieves a compression ratio very close to the best DNA compressors. Another important feature of our algorithm is its small space occupancy which makes it possible to compress sequences hundreds of megabytes long, well beyond the range of any previous DNA compressor. 1
Similarity of objects and the meaning of words
 In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS
, 2006
"... Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for tha ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract. We survey the emerging area of compressionbased, parameterfree, similarity distance measures useful in datamining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. 1