## Similarity of objects and the meaning of words (2006)

### Cached

### Download Links

- [www.cwi.nl]
- [homepages.cwi.nl]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS |

Citations: | 6 - 0 self |

### BibTeX

@INPROCEEDINGS{Cilibrasi06similarityof,

author = {Rudi Cilibrasi and Paul Vitanyi},

title = {Similarity of objects and the meaning of words},

booktitle = {In Proc. 3rd Annual Conferene on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of LNCS},

year = {2006},

pages = {21--45},

publisher = {Springer}

}

### OpenURL

### Abstract

Abstract. We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red ” or “christianity. ” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. 1

### Citations

2420 | A Turtorial on Support Vector Machines for Pattern Recognition
- Burges
- 1998
(Show Context)
Citation Context ...so for the boundary case were C = K. In practice, real-world compressors appear to satisfy this weaker distributivity property up to the required precision. Definition 6. Define C(y|x) = C(xy) −C(x). =-=(5)-=- This number C(y|x) of bits of information in y, relative to x, can be viewed as the excess number of bits in the compressed version of xy compared to the compressed version of x, and is called the am... |

1735 | An introduction to Kolmogorov complexity and its applications
- Li, Vitányi
- 1993
(Show Context)
Citation Context ... The distances determined between objects are justified by ad-hoc plausibility arguments and represent a partially independent development (although they refer to the information distance approach of =-=[27,3]-=-). Altogether, it appears that the notion of compression-based similarity metric is so powerful that its performance is robust under considerable variations. 2 Similarity Distance We briefly outline a... |

1177 | A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge
- Landauer, Dumais
- 1997
(Show Context)
Citation Context ...robabilities of search terms and the computed NGD ’s should stabilize (become scale invariant) with a growing Google database. Related Work: There is a great deal of work in both cognitive psychology =-=[22]-=-, linguistics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in [37,38], going back to at... |

717 | CYC: A Large-scale investment in knowledge infrastructure
- Lenat
- 1995
(Show Context)
Citation Context ...f Health (NIH).s4 Google-Based Similarity To make computers more intelligent one would like to represent meaning in computerdigestable form. Long-term and labor-intensive efforts like the Cyc project =-=[23]-=- and the WordNet project [36] try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions tha... |

639 | Musical Genre Classification of Audio Signals
- Tzanetakis, Cook
- 2002
(Show Context)
Citation Context ...four-letter alphabet); from music files one can extract various specific numerical features, related to pitch, rhythm, harmony etc. One can extract such features using for instance Fourier transforms =-=[39]-=- or wavelet transforms [18], to quantify parameters expressing similarity. The resulting vectors corresponding to the various files are then classified or clustered using existing classification softw... |

545 | Three approaches to the quantitative definition of information - Kolmogorov - 1968 |

199 | Selecting the right interestingness measure for association patterns
- Tan, Kumar, et al.
(Show Context)
Citation Context ...cognitive psychology [22], linguistics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in =-=[37,38]-=-, going back to at least [24]. One of the most successful is Latent Semantic Analysis (LSA) [22] that has been applied in various forms in a great number of applications. As with LSA, many other previ... |

196 | The similarity metric
- Li, Chen, et al.
(Show Context)
Citation Context ...general gives good results on natural data sets. The early use of the “sum” distance was replaced by the “max” distance in [30] in 2001 and applied to mammal phylogeny in 2001 in the early version of =-=[31]-=- and in later versions also to the language tree. In [31] it was shown that an appropriately normalized “max” distance is metric, and minorizes all normalized computable distances satisfying a certain... |

186 | Clustering by compression
- Cilibrasi, Vitányi
(Show Context)
Citation Context ... universal metric distance zooms in on the dominant similarity between those two objects out of a wide class of admissible similarity features. Hence it may be called “the” similarity metric. In 2003 =-=[12]-=- it was realized that the method could be used for hierarchical clustering of natural data sets from arbitrary (also heterogenous) domains, and the theory related to the application of real-world comp... |

119 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...from arbitrary (also heterogenous) domains, and the theory related to the application of real-world compressors was developed, and numerous applications in different domains were given, Section 3. In =-=[19]-=- the authors use a simplified version of the similarity metric, which also performs well. Ins[2], and follow-up work, a closely related notion of compression-based distances is proposed. There the pur... |

106 |
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
- Li, Badger, et al.
- 2001
(Show Context)
Citation Context ...ion of the “sum” information distance (K(x|y)+K(y|x))/K(xy) was introduced as a similarity distance and applied to construct a phylogeny of bacteria in [28], and subsequently mammal phylogeny in 2001 =-=[29]-=-, followed by plagiarism detection in student programming assignments [6], and phylogeny of chain letters in [4]. In [29] it was shown that the normalized sum distance is a metric, and minorizes certa... |

91 |
von Haeseler, Quartet puzzling: A quartet maximumlikelihood method for reconstructing tree topologies
- Strimmer, A
- 1996
(Show Context)
Citation Context ...smallest differences in distances, and therefore use the most sensitive method to get greatest accuracy. Here, we use a new quartet-method (actually a new version [12] of the quartet puzzling variant =-=[35]-=-), which is a heuristic based on randomized parallel hill-climbing geneticsprogramming. In this paper we do not describe this method in any detail, the reader is referred to [12], or the full descript... |

86 | C.L.A.: Frequency estimates for statistical word similarity measures
- Terra, Clarke
- 2003
(Show Context)
Citation Context ...cognitive psychology [22], linguistics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in =-=[37,38]-=-, going back to at least [24]. One of the most successful is Latent Semantic Analysis (LSA) [22] that has been applied in various forms in a great number of applications. As with LSA, many other previ... |

83 |
Language trees and zipping
- Benedetto, Caglioti, et al.
- 2002
(Show Context)
Citation Context ...ld compressors was developed, and numerous applications in different domains were given, Section 3. In [19] the authors use a simplified version of the similarity metric, which also performs well. Ins=-=[2]-=-, and follow-up work, a closely related notion of compression-based distances is proposed. There the purpose was initially to infer a language tree from different-language text corpora, as well as do ... |

65 | Folk Music Classification Using Hidden Markov Models
- Chai, Vercoe
- 2001
(Show Context)
Citation Context ...es are then classified or clustered using existing classification software, based on various standard statistical pattern recognition classifiers [39], Bayesian classifiers [15], hidden Markov models =-=[9]-=-, ensembles of nearest-neighbor classifiers [18] or neural networks [15,34]. For example, in music one feature would be to look for rhythm in the sense of beats per minute. One can make a histogram wh... |

64 | Shared information and program plagiarism detection
- Chen, Francia, et al.
- 2004
(Show Context)
Citation Context ... as a similarity distance and applied to construct a phylogeny of bacteria in [28], and subsequently mammal phylogeny in 2001 [29], followed by plagiarism detection in student programming assignments =-=[6]-=-, and phylogeny of chain letters in [4]. In [29] it was shown that the normalized sum distance is a metric, and minorizes certain computable distances up to a multiplicative factor of 2 with high prob... |

64 | A Machine Learning Approach to Musical Style Recognition
- Dannenberg, Thom, et al.
- 1997
(Show Context)
Citation Context ...sponding to the various files are then classified or clustered using existing classification software, based on various standard statistical pattern recognition classifiers [39], Bayesian classifiers =-=[15]-=-, hidden Markov models [9], ensembles of nearest-neighbor classifiers [18] or neural networks [15,34]. For example, in music one feature would be to look for rhythm in the sense of beats per minute. O... |

64 |
Word-word associations in document retrieval systems
- LESK
- 1969
(Show Context)
Citation Context ...istics, and computer science, about using word (phrases) frequencies in text corpora to develop measures for word similarity or word association, partially surveyed in [37,38], going back to at least =-=[24]-=-. One of the most successful is Latent Semantic Analysis (LSA) [22] that has been applied in various forms in a great number of applications. As with LSA, many other previous approaches of extracting ... |

64 |
Wordnet: A Lexical Database for English Language. cogsci.princeton.edu/ wn
- Miller
(Show Context)
Citation Context ...d Similarity To make computers more intelligent one would like to represent meaning in computerdigestable form. Long-term and labor-intensive efforts like the Cyc project [23] and the WordNet project =-=[36]-=- try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence an... |

53 | Algorithmic clustering of music based on string compression
- Cilibrasi, Vitanyi, et al.
- 2004
(Show Context)
Citation Context ... a universal similarity metric (10) among objects given as finite binary strings, and, apart from what was mentioned in the Introduction, has been applied to objects like music pieces in MIDI format, =-=[11]-=-, computer programs, genomics, virology, language tree of non-indo-european languages, literature in Russian Cyrillic and English translation, optical character recognition of handwrittern digits in s... |

48 | Mapping Ontologies into Cyc - Reed, Lenat - 2002 |

47 | Combinatorial foundations of information theory and the calculus of probabilities - Kolmogorov - 1983 |

38 | Automatic meaning discovery using Google
- Cilibrasi, Vitanyi
- 2004
(Show Context)
Citation Context ...o 6-dimensional vectors in the same manner, as the training examples. This results in an accuracy score of correctly classified test examples. We ran 100 experiments. The actual data are available at =-=[10]-=-. A histogram of agreement accuracies is shown in Figure 7. On average, our method number of trials 30 25 20 15 10 5 0 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 accuracy Accuracy Histogram Fig. 7. Histogram of ac... |

34 | Reversibility and adiabatic computation: Trading time and space for energy
- Li, Vitanyi
- 1996
(Show Context)
Citation Context ...ecisely, from which the object can be recovered by a fixed algorithm. The “sum” version of information distance, K(x|y)+K(y|x), arose from thermodynamical considerations about reversible computations =-=[25,26]-=- in 1992. It is a metric and minorizes all computable distances satisfying a given density condition up to a multiplicative factor of 2. Subsequently, in 1993, the “max” version of information distanc... |

30 |
Chain letters and evolutionary histories,” Scientific Amer
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ...onger distributivity property C(xyz)+C(z) ≤ C(xz)+C(yz) (3) holds (with K = C). However, to prove the desired properties of NCD below, only the weaker distributivity property C(xy)+C(z) ≤ C(xz)+C(yz) =-=(4)-=- above is required, also for the boundary case were C = K. In practice, real-world compressors appear to satisfy this weaker distributivity property up to the required precision. Definition 6. Define ... |

18 |
The CompLearn Toolkit
- Cilibrasi
- 2003
(Show Context)
Citation Context ...type of compressor used. The clustering we use is hierarchical clustering in dendrograms based on a new fast heuristic for the quartet method. The method is available as an open-source software tool, =-=[7]-=-. Feature-Based Similarities: We are presented with unknown data and the question is to determine the similarities among them and group like with like together. Commonly, the data are of a certain typ... |

18 | Classifying music by genre using the wavelet packet transform and a round-robin ensemble
- Grimaldi, Kokaram, et al.
- 2002
(Show Context)
Citation Context ... music files one can extract various specific numerical features, related to pitch, rhythm, harmony etc. One can extract such features using for instance Fourier transforms [39] or wavelet transforms =-=[18]-=-, to quantify parameters expressing similarity. The resulting vectors corresponding to the various files are then classified or clustered using existing classification software, based on various stand... |

16 |
Music classification using neural networks
- Scott
- 2001
(Show Context)
Citation Context ...are, based on various standard statistical pattern recognition classifiers [39], Bayesian classifiers [15], hidden Markov models [9], ensembles of nearest-neighbor classifiers [18] or neural networks =-=[15,34]-=-. For example, in music one feature would be to look for rhythm in the sense of beats per minute. One can make a histogram where each histogram bin corresponds to a particular tempo in beats-per-minut... |

14 | Algorithmic Complexity
- Li, Vitányi
(Show Context)
Citation Context ...ximation, imprecise though it is, but guided by an ideal provable theory, in general gives good results on natural data sets. The early use of the “sum” distance was replaced by the “max” distance in =-=[30]-=- in 2001 and applied to mammal phylogeny in 2001 in the early version of [31] and in later versions also to the language tree. In [31] it was shown that an appropriately normalized “max” distance is m... |

12 | A new quartet tree heuristic for hierarchical clustering, Arxiv preprint cs/0606048
- Cilibrasi, Vitanyi
- 2006
(Show Context)
Citation Context ...h is a heuristic based on randomized parallel hill-climbing geneticsprogramming. In this paper we do not describe this method in any detail, the reader is referred to [12], or the full description in =-=[14]-=-. It is implemented in the CompLearn package [7]. We describe the idea of the algorithm, and the interpretation of the accuracy of the resulting tree representation of the data clustering. To cluster ... |

10 |
Analyzing network traffic and worms using compression
- Wehner
(Show Context)
Citation Context ...he other methods in the simpler tasks. We developed the CompLearn Toolkit, [7], and performed experiments in vastly different application fields to test the quality and universality of the method. In =-=[40]-=-, the method is used to analyze network traffic and cluster computer worms and virusses. Currently, a plethora of new applications of the method arise around the world, in many areas, as the reader ca... |

6 | On the Google-fame of scientists and other populations
- Bagrow, ben-Avraham
(Show Context)
Citation Context ...nce, d(x,y), based on D relative to a reference compressor C, is defined by d(x,y) = D(x,y) D + (x,y) . It follows from the definitions that a normalized admissible distance is a function d : Ω × Ω → =-=[0,1]-=- that is symmetric: d(x,y) = d(y,x). Lemma 1. For every x ∈ Ω, and constant e ∈ [0,1], a normalized admissible distance satisfies the density constraint |{y : d(x,y) ≤ e, C(y) ≤ C(x)}| < 2 eD+ (x)+1 .... |

4 |
Learning by
- Cimiano, Staab
(Show Context)
Citation Context ...sense. Approximation of the denominator of (7) by a given compressor C is straightforward: it is max{C(x),C(y)}. The numerator is more tricky. It can be rewritten as max{K(x,y) − K(x),K(x,y) − K(y)}, =-=(8)-=- within logarithmic additive precision, by the additive property of Kolmogorov complexity [27]. The term K(x,y) represents the length of the shortest program for the pair (x,y). In compression practic... |

4 |
Collosal: How Well Does the World Wide Web Represent Human Language
- “Corpus
- 2005
(Show Context)
Citation Context ... mathematical techniques like singular value decomposition and dimensionality reduction, and that are in local storage, and on assumptions that are more restricted, than what we propose. In contrast, =-=[41,8,1]-=- and the many references cited there, use the web and Google counts to identify lexico-syntactic patterns or other data. Again, the theory, aim, feature analysis, and execution are different from ours... |

3 | Theory of thermodynamics of computation
- Li, Vitányi
- 1992
(Show Context)
Citation Context ...ecisely, from which the object can be recovered by a fixed algorithm. The “sum” version of information distance, K(x|y)+K(y|x), arose from thermodynamical considerations about reversible computations =-=[25,26]-=- in 1992. It is a metric and minorizes all computable distances satisfying a given density condition up to a multiplicative factor of 2. Subsequently, in 1993, the “max” version of information distanc... |

3 | A compression algorithm for DNA sequences based on approximate matching - Chen, Kwong, et al. - 2000 |

1 |
A New Quartet Tree Heuristic for Hierarchical
- Cilibrasi, Vitanyi
(Show Context)
Citation Context ...h is a heuristic based on randomized parallel hill-climbing geneticsprogramming. In this paper we do not describe this method in any detail, the reader is referred to [12], or the full description in =-=[14]-=-. It is implemented in the CompLearn package [7]. We describe the idea of the algorithm, and the interpretation of the accuracy of the resulting tree representation of the data clustering. To cluster ... |