• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Clustering by compression (2005)

Cached

  • Download as a PDF

Download Links

  • [eprints.pascal-network.org]
  • [www.illc.uva.nl]
  • [arxiv.org]
  • [homepages.cwi.nl]
  • [homepages.cwi.nl]
  • [homepages.cwi.nl]
  • [www.cwi.nl]
  • [www.cwi.nl]
  • [homepages.cwi.nl]
  • [users.ecs.soton.ac.uk]
  • [homepages.cwi.nl]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Rudi Cilibrasi , Paul M. B. Vitányi
Venue:IEEE Transactions on Information Theory
Citations:295 - 25 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@ARTICLE{Cilibrasi05clusteringby,
    author = {Rudi Cilibrasi and Paul M. B. Vitányi},
    title = {Clustering by compression},
    journal = {IEEE Transactions on Information Theory},
    year = {2005},
    volume = {51},
    pages = {1523--1545}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Abstract—We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorovcomplexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis. Index Terms—Heterogenous data analysis, hierarchical unsupervised clustering, Kolmogorovcomplexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance. I.

Keyphrases

normalized compression distance    new evidence    eutherian order    new method    application area boundary    subject-specific feature    specific application area    block sorting compressor    new quartet method    handwritten digit    distance matrix    different compressor    data file    similarity distance    hierarchical clustering method    marsupionta hypothesis    background knowledge    index term heterogenous data analysis    public software    whole-mitochondrial genomic analysis    ternary tree    theria hypothesis    different domain    normalized information distance    successful application    parameter-free data mining    hierarchical unsupervised clustering    noncomputable notion    real-world setting    theoretical precursor    mammalian evolution    quartet tree method    major question    pairwise concatenation    universal dissimilarity distance   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University