## Clustering by compression (2005)

### Cached

### Download Links

- [eprints.pascal-network.org]
- [www.illc.uva.nl]
- [arxiv.org]
- [homepages.cwi.nl]
- [homepages.cwi.nl]
- [www.cwi.nl]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE Transactions on Information Theory |

Citations: | 179 - 23 self |

### BibTeX

@ARTICLE{Cilibrasi05clusteringby,

author = {Rudi Cilibrasi and Paul M. B. Vitányi},

title = {Clustering by compression},

journal = {IEEE Transactions on Information Theory},

year = {2005},

volume = {51},

pages = {1523--1545}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.

### Citations

8595 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...ORY, VOL. 51, NO. 4, APRIL 2005 If h is an admissible distance, then for every �, the set �h@�Y �A X � P�HY I� 3 � is the length set of a prefix code. Hence, it satisfies by the Kraft inequality, see =-=[12]-=-, � P 0h@�Y�A IX (II.1) Example 2.2: In representing the Hamming distance � between two strings of equal length � differing in positions �IY FFFY��, we can use a simple prefix-free encoding of @�Y �Y ... |

2302 |
The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution
- Saitou, Nei
- 1987
(Show Context)
Citation Context ...cause of computational limitations one uses only parts of the genome, or certain proteins that are viewed as significant [21]. These are run through a tree reconstruction method like neighbor joining =-=[38]-=-, maximum likelihood, maximum evolution, maximum parsimony as in [21], or quartet hypercleaning [6], many times. The percentage-wise agreement on certain branches arising are called “bootstrap values.... |

1687 | Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, 2nd edition edition
- Li, Vitanyi
- 1997
(Show Context)
Citation Context ...the length of the shortest binary program, for the reference universal prefix Turing machine, that on input y outputs x; it is denoted as K(x|y). For precise definitions, theory and applications, see =-=[32]-=-. The Kolmogorov complexity of x is the length of the shortest binary program with no input that outputs x; it is denoted as K(x) = K(x|λ) where λ denotes the empty input. Essentially, the Kolmogorov ... |

606 | Musical Genre Classification of Audio Signals
- Tzanetakis, Cook
(Show Context)
Citation Context ...r-letter alphabet); from music files one can extract various specific numerical features, related to pitch, rhythm, harmony, etc. One can extract such features using, for instance, Fourier transforms =-=[43]-=- or wavelet transforms [17], to quantify parameters expressing similarity. The resulting vectors corresponding to the various files are then classified or clustered using existing classification softw... |

201 |
Nonmetric multidimensional scaling: a numerical method
- Kruskal
- 1964
(Show Context)
Citation Context ...ure of distortion to be minimized, [16]. Let the original set of distances be �IY FFFY�� and the projected distances be � H IY FFFY� H �.In Fig. 8, we used the distortion measure Kruskall’s stress-1, =-=[24]-=-, which minimizes @ @�� 0 � � � H �APAa � � � P � X Kruskall’s stress-1 equal H means no distortion, and the worst value is at most I (unless you have a really bad projection). In the projection of th... |

187 | The similarity metric
- Li, Chen, et al.
(Show Context)
Citation Context ...le to simultaneously detect all similarities between pieces that other effective distances can detect seperately. Compression-Based Similarity: Such a “universal” metric was co-developed by usin [29]–=-=[31]-=-, asa normalized version of the “information metric” of [4], [32]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea be... |

175 | Feature Extraction Methods for Character Recognition- A survey
- Trier, Jain, et al.
- 1996
(Show Context)
Citation Context ...ingle decimal digit recognition accuracy of 85%. The current state-of-the-art for this problem, after half a century of interactive feature-driven classification research, in the upper ninety % level =-=[32, 14]-=-. All experiments are bench marked on the standard NIST Special Data Base 19 (optical character recognition database). 5.6 Astronomy As a proof of principle we clustered data from unknown objects, for... |

135 |
Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature
- Rokas, Williams, et al.
- 2003
(Show Context)
Citation Context ...ap values on the branches that are viewed as supporting the theory tested. Different choices of proteins result in different best trees. One way to avoid thisambiguity isto use the full genome, [31], =-=[36]-=-, leading to whole-genome phylogeny. With our method we do whole-genome phylogeny, and end up with a single overall best tree, not optimizing selected parts of it. The quality of the results depends o... |

118 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ...eported here, the clustering by compression method reported in this correspondence has recently been used to analyze network traffic and cluster computer worms and virusses [44]. Finally, recent work =-=[20]-=- reports experiments with our method on all time sequence data used in all the major data-mining conferences in the last decade. Comparing the compression method with all major methods used in those c... |

107 |
information and kolmogorov complexity
- Grunwald, Vitanyi
(Show Context)
Citation Context ...ositions �IY FFFY��, we can use a simple prefix-free encoding of @�Y �Y �IY FFFY��A in P���� C R ��� ��� � CPC� ��� � bits. We encode � and � prefix-free in ��� � C P ��� ��� � CIbitseach, see, e.g., =-=[32]-=-, and then the literal indexes of the actual flipped-bit positions. Adding an y@IA-bit program to interpret these data, with the strings concerned being � and �, we have defined r�@�Y �A a P ��� � C R... |

103 |
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
- Li, Badger, et al.
(Show Context)
Citation Context ...be able to simultaneously detect all similarities between pieces that other effective distances can detect seperately. Compression-Based Similarity: Such a “universal” metric was co-developed by usin =-=[29]-=-–[31], asa normalized version of the “information metric” of [4], [32]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the id... |

84 |
A Philosophical Essay on Probabilities
- Laplace
- 1951
(Show Context)
Citation Context ...eing interchanged, then six items get misplaced.) The probability that thishappensby chance isextremely small. The reason why we think the method does something remarkable is concisely put by Laplace =-=[28]-=-: “If we seek a cause wherever we perceive symmetry, it is not that we regard the symmetrical event as less possible than the others, but, since this event ought to be the effect of a regular cause or... |

82 |
2002a): Language Trees and Zipping
- Benedetto, Caglioti, et al.
(Show Context)
Citation Context ...more similar, then we can more succinctly describe one given the other. The mathematics used is based on Kolmogorov complexity theory [30]. In [29] we defined a new class of metrics, taking values in =-=[0; 1] and appro-=-priate for measuring effective similarity relations between sequences, say one type of similarity per metric, and vice versa. It was shown that an appropriately "normalized" information dist... |

62 | Shared Information and Program Plagiarism Detection
- Chen, Francia, et al.
(Show Context)
Citation Context ...c construction of a language tree for over 50s1524 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 Euro-Asian languages [31], detects plagiarism in student programming assignments =-=[8]-=-, gives phylogeny of chain letters [5], and clusters music [10]. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel–Ziv based dicti... |

59 | Folk music classification using hidden Markov models
- Chai, Vercoe
- 2001
(Show Context)
Citation Context ...es are then classified or clustered using existing classification software, based on various standard statistical pattern recognition classifiers [43], Bayesian classifiers [15], hidden Markov models =-=[13]-=-, ensembles of nearest neighbor classifiers [17], or neural networks[15], [39]. For example, in music, one feature would be to look for rhythm in the sense of beats per minute. One can make a histogra... |

57 | A machine learning approach to musical style recognition
- Dannenberg, Thom, et al.
- 1997
(Show Context)
Citation Context ...sponding to the various files are then classified or clustered using existing classification software, based on various standard statistical pattern recognition classifiers [43], Bayesian classifiers =-=[15]-=-, hidden Markov models [13], ensembles of nearest neighbor classifiers [17], or neural networks[15], [39]. For example, in music, one feature would be to look for rhythm in the sense of beats per minu... |

54 | VISHKIN: Communication complexity of document exchange
- CORMODE, PATERSON, et al.
- 2000
(Show Context)
Citation Context ...es: Our aim is to capture, in a single similarity metric, every effective distance: effective versions of Hamming distance, Euclidean distance, edit distances, alignment distance, Lempel–Ziv distance =-=[11]-=-, and so on. This metric should be so general that it works in every domain: music, text, literature, programs, genomes, executables, natural language determination, equally and simultaneously. It wou... |

53 | Algorithmic clustering of music based on string compression
- Cilibrasi, Vitanyi, et al.
- 2004
(Show Context)
Citation Context ...CTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 Euro-Asian languages [31], detects plagiarism in student programming assignments [8], gives phylogeny of chain letters [5], and clusters music =-=[10]-=-. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel–Ziv based dictionary (gzip), block based (bzip2), or special purpose (Gencompr... |

53 | Automatic music summarization via similarity analysis - Cooper, Foote - 2002 |

53 |
A novel coronavirus associated with severe acute respiratory syndrome
- Ksiazek, Erdman, et al.
- 2003
(Show Context)
Citation Context ...ted using the compressor bzip2. The relations in Figure 10 are very similar to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of Medicine, =-=[23]-=-. We depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since the former is more precise for visual inspection of proximity relations. Analysis of Mitochondrial ... |

48 | 2003b "Automatic Récognition of Handwritten Numerical Strings", Doctoral Dissertation, Montréal, École de technologie supérieure
- Oliveira
- 2003
(Show Context)
Citation Context ... single decimal digit recognition accuracy of 87%. The current state-of-the-art for this problem, after half a century of interactive feature-driven classification research, is in the upper 90% level =-=[34]-=-, [40]. All experimentsare benchmarked on the standard NIST Special Data Base 19. Using the NCD for general classification by compression is the subject of a future paper. F. Astronomy As a proof of p... |

46 | A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract
- Berry, Bryant, et al.
(Show Context)
Citation Context ...iewed as significant [21]. These are run through a tree reconstruction method like neighbor joining [38], maximum likelihood, maximum evolution, maximum parsimony as in [21], or quartet hypercleaning =-=[6]-=-, many times. The percentage-wise agreement on certain branches arising are called “bootstrap values.” Trees are depicted with the best bootstrap values on the branches that are viewed as supporting t... |

33 | Data Compression - Salomon - 2007 |

31 | A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application
- Jiang, Kearney, et al.
- 2000
(Show Context)
Citation Context ... the quartet method isto find (or approximate asclosely as possible) the tree that embeds the maximal number of consistent (possibly weighted) quartet topologies from a given set of quartet topologies=-=[19]-=- (Fig. 2). Thisiscalled the (weighted) maximum quartet consistency (MQC) problem. We propose a new optimization problem: the minimum quartet tree cost (MQTC), as follows. The cost of a quartet topolog... |

30 |
Conflict among individual mitochondrial proteins in resolving the phylogeny of Eutherian orders
- Cao, Janke, et al.
- 1998
(Show Context)
Citation Context ...c material hasbecome available, the debate in biology hasintensified concerning which two of the three main groupsof placental mammalsare more closely related: Primates, Ferungulates, and Rodents. In =-=[7]-=-, the maximum-likelihood method of phylogeny tree reconstruction gave evidence for the (Ferungulates, (Primates, Rodents)) grouping for half of the proteinsin the mitochondrial genomesinvestigated, an... |

30 |
Chain letters and evolutionary histories
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ...[29, 30, 31], a completely automatic construction of a language tree for over 50 Euro-Asian languages [31], detects plagiarism in student programming assignments [8], gives phylogeny of chain letters =-=[5]-=-, and clusters music [10]. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), or ... |

20 |
Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceaea, and the proposal of the new genera
- Kurtzman
(Show Context)
Citation Context ...ent of the Saccharomycetaceae, S. servazii, S. castellii, and C. glabrata were all proposed to belong to genera different from Saccharomyces, and this is supported by the topology of our tree aswell (=-=[27]-=-).” To compare the veracity of the NCD clustering with a more feature-based clustering, we also calculated the pairwise distances as folIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL ... |

18 |
The CompLearn Toolkit
- Cilibrasi
- 2003
(Show Context)
Citation Context ... INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1525 automatically classifies the objects concerned. The method has been released into the public domain as open-source software. The CompLearn Toolkit =-=[9]-=- is a suite of simple utilities that one can use to apply compression techniques to the process of discovering and learning patterns in completely different domains. In fact, this method is so general... |

18 | Classifying music by genre using the wavelet packet transform and a round-robin ensemble
- Grimaldi, Kokaram, et al.
- 2002
(Show Context)
Citation Context ...sic files one can extract various specific numerical features, related to pitch, rhythm, harmony, etc. One can extract such features using, for instance, Fourier transforms [43] or wavelet transforms =-=[17]-=-, to quantify parameters expressing similarity. The resulting vectors corresponding to the various files are then classified or clustered using existing classification software, based on various stand... |

16 |
Music classification using neural networks
- Scott
- 2001
(Show Context)
Citation Context ...sed on various standard statistical pattern recognition classifiers [43], Bayesian classifiers [15], hidden Markov models [13], ensembles of nearest neighbor classifiers [17], or neural networks[15], =-=[39]-=-. For example, in music, one feature would be to look for rhythm in the sense of beats per minute. One can make a histogram where each histogram bin corresponds to a particular tempo in beats-per-minu... |

15 | Algorithmic clustering of music
- Cilibrasi, Wolf, et al.
- 2004
(Show Context)
Citation Context ...ly automatic construction of a language tree for over 50 Euro-Asian languages [31], detects plagiarism in student programming assignments [8], gives phylogeny of chain letters [5], and clusters music =-=[10]-=-. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), or special purpose (Gencompr... |

14 | Algorithmic Complexity
- Li, Vitányi
(Show Context)
Citation Context ..., orangutan (Pongo pygmaeus), Sumatran orangutan (Pongo pygmaeus abelii), using opossum (Didelphis virginiana), wallaroo (Macropus robustus), and the platypus(Ornithorhynchus anatinus) asoutgroup. In =-=[30]-=-, [31], we used the whole mitochondrial genomes of the same 20 species, computing the NCD distances (or a closely related distance in [30]), using the GenCompress compressor, followed by tree reconstr... |

14 | Hierarchical clustering based on mutual information
- Kraskov, Stögbauer, et al.
- 2003
(Show Context)
Citation Context ...ad hoc arguments about empirical Shannon entropy and Kullback-Leibler distance. This approach is used to cluster music MIDI files by Kohonen maps in [31]. Another recent offshoot based on our work is =-=[21]-=- hierarchical clustering based on mutual information. In a related, but considerably simpler feature-based approach, one can compare the word frequencies in text files to assess similarity. In [40] th... |

10 |
Marsupials and eutherians reunited: genetic evidence for the theria hypothesis of mammalian evolution
- Killian, Buckley, et al.
(Show Context)
Citation Context ...orithm. Thisisa big difference compared to previous phylogeny methods, where because of computational limitations one uses only parts of the genome, or certain proteins that are viewed as significant =-=[21]-=-. These are run through a tree reconstruction method like neighbor joining [38], maximum likelihood, maximum evolution, maximum parsimony as in [21], or quartet hypercleaning [6], many times. The perc... |

10 |
A philosophical essay on probabilities, 1819. English translation
- Laplace
- 1951
(Show Context)
Citation Context ...s get dislodged then six items get misplaced.) The probability that this happens by chance is extremely small. The reason why we think the method does something remarkable is concisely put by Laplace =-=[26]: &qu-=-ot;If we seek a cause wherever we perceive symmetry, it is not that we regard the symmetrical event as less possible than the others, but, since this event ought to be the effect of a regular cause or... |

9 |
Arnason, Phylogenetic analysis of 18S rRNA and the mitochondrial genomes of wombat, Vombatus ursinus, and the spiny anteater, Tachyglossus acelaetus: increased support for the Marsupionta hypothesis
- Janke, Magnell, et al.
(Show Context)
Citation Context ...ia hypothesis wasgiven in [21] by analyzing a large nuclear gene (M6P/IG2R), viewed as important across the species concerned, and even more recent support for the Marsupionta hypothesis was given in =-=[18]-=- by phylogenetic analysis of another sequence from the nuclear gene (18S rRNA) and by the whole mitochondrial genome. Experimental evidence: To test the Eutherian orders simultaneously with the Marsup... |

8 |
Music style and authorship categorization by informative compressors
- Londei, Loreto, et al.
(Show Context)
Citation Context ...n in [31, Appendix I]. This approach is used also to cluster music MIDI (Musical Instrument Digital Interface, a versatile digital music format available on the World-Wide Web) filesby Kohonen mapsin =-=[33]-=-. Another recent offshoot based on our work is hierarchical clustering based on mutual information, [23]. In a related, but considerably simpler feature-based approach, one can compare the word freque... |

8 |
Algorithm makes tongue tree
- Ball
- 2002
(Show Context)
Citation Context ... it is perhaps surprising that compression based clustering and classification approaches did not arise before. But recently there have been several partially independent proposals in that direction: =-=[1, 2]-=- for building language trees—while citing [29, 4]—is by ad hoc arguments about empirical Shannon entropy and Kullback-Leibler distance resulting in non-metric distances. This approach is used to clust... |

7 |
Information distance
- Bennet, Gács, et al.
- 1998
(Show Context)
Citation Context ...that other effective distances can detect seperately. Compression-Based Similarity: Such a “universal” metric was co-developed by usin [29]–[31], asa normalized version of the “information metric” of =-=[4]-=-, [32]. Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea being that if two piecesare more similar, then we can more su... |

6 |
General Assembly resolution 217 A (III) of 10 December 1948: Universal Declaration of Human Rights
- Nations
(Show Context)
Citation Context ..., agreed more precisely with accepted biological knowledge. 5.2 Language Trees Our method improves the results of [1], using a linguistic corpus of "The Universal Declaration of Human Rights (UDo=-=HR)" [33]-=- in 52 languages. Previously, [1] used an asymmetric measure based on relative entropy, and the full matrix of the pair-wise distances between all 52 languages, to build a language classification tree... |

5 |
Paradijs, A model-independent analysis of the variability of GRS 1915+105, Astronomy and Astrophysics
- Belloni, Klein-Wolt, et al.
(Show Context)
Citation Context ... NCD for general classification by compression is the subject of a future paper. F. Astronomy As a proof of principle, we clustered data from unknown objects, for example objectsthat are far away. In =-=[3]-=-, observationsof the microquasar GRS IWIS C IHS made with the Rossi X-ray Timing Eplorer were analyzed. The interest in this microquasar stems from the fact that it was the first galactic object to sh... |

5 |
et al. A novel coronavirus associated with severe acute respiratory syndrome
- Ksiazek, Erdman, et al.
(Show Context)
Citation Context ...puted using the compressor bzip2. The relations in Fig. 10 are very similar to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of Medicine, =-=[25]-=-. We depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since the former is more precise for visual inspection of proximity relations. 3) Analysis of Mitochondri... |

5 | Normalized forms for two common metrics
- Yianilos
(Show Context)
Citation Context ...like ours) is often called a “dissimilarity” distance or a “disparity” distance. Remark 2.7: Asfar asthe authorsknow, the idea of normalized metric is, surprisingly, not well studied. An exception is =-=[41]-=-, which investigates normalized metrics to account for relative distances rather than absolute ones, and it does so for much the same reasons as in the present work. An example is the normalized Eucli... |

4 |
Information categorization approach to literary authorship disputes
- Yang, Peng, et al.
(Show Context)
Citation Context ...s hierarchical clustering based on mutual information, [23]. In a related, but considerably simpler feature-based approach, one can compare the word frequencies in text files to assess similarity. In =-=[42]-=-, the word frequencies of words common to a pair of text files are used as entries in two vectors, and the similarity of the two files is based on the distance between those vectors. The authors attri... |

4 |
Automatic catagorizing written texts by author gender, Literary and Linguistic Computing
- Koppel, Argamon, et al.
(Show Context)
Citation Context ...nt biological wisdom). The possibly new feature in the cited work is that it uses statistics of only the words that the files being compared have in common. A related, opposite, approach was taken in =-=[20]-=-, where literary texts are clustered by author gender or fact versus fiction, essentially by first identifying distinguishing features, like gender dependent word usage, and then classifying according... |

3 |
Ascomycetous yeasts and yeast-like taxa. In: The mycota VII, Systemtics and evolution, part A
- Kurtzman, Sugiyama
- 2001
(Show Context)
Citation Context ...f the fungi researchers is “the tree clearly clustered the ascomycetous yeasts versus the two filamentous ascomycetes, thus supporting the current hypothesis on their classification (for example, see =-=[26]-=-). Interestingly, in a recent treatment of the Saccharomycetaceae, S. servazii, S. castellii, and C. glabrata were all proposed to belong to genera different from Saccharomyces, and this is supported ... |

3 |
Ma,Chain letters and evolutionary histories, Scientific American
- Bennett, Li, et al.
- 2003
(Show Context)
Citation Context ...27, 28, 29], a completely automatic construction of a language tree for over 50 Euro-Asian languages [29], detects plagiarism in student programming assignments [38], gives phylogeny of chain letters =-=[5]-=-, and clusters music [8]. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), or s... |

3 |
Algorithmic Complexity, pp. 376--382 in
- Li, Vitanyi
(Show Context)
Citation Context ...y. It would be able to simultaneously detect all similarities between pieces that other effective metrics can detect. Compression-based Similarity: Such a "universal" metric was co-developed=-= by us in [27, 28, 29], as a normalized ve-=-rsion of the "information metric" of [30, 4]. Roughly speaking, two objects are deemed close if we can significantly "compress" one given the information in the other, the idea bei... |

2 |
General Assembly Resolution 217 A (III) of 10
- Nations
- 1948
(Show Context)
Citation Context ...s, agreed more precisely with accepted biological knowledge. B. Language Trees Our method improves the results of [1], using a linguistic corpus of “The Universal Declaration of Human Rights (UDoHR)” =-=[35]-=- in 52 languages. Previously, [1] used an asymmetric measure based on relative entropy, and the full matrix of the pair-wise distances between all 52 languages, to build a language classification tree... |

2 |
Algorithmic clustering of music, http://xxx.lanl.gov/abs/cs.SD/0303025
- Cilibrasi, Vitanyi, et al.
(Show Context)
Citation Context ...y automatic construction of a language tree for over 50 Euro-Asian languages [29], detects plagiarism in student programming assignments [38], gives phylogeny of chain letters [5], and clusters music =-=[8]-=-. Moreover, the method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), or special purpose (Gencompr... |