Results 1 -
5 of
5
P.M.B.: The Google similarity distance
- IEEE Transactions on Knowledge and Data Engineering
, 2007
"... Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarit ..."
Abstract
-
Cited by 98 (4 self)
- Add to MetaCart
Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories. Index Terms—Accuracy comparison with WordNet categories, automatic classification and clustering, automatic meaning discovery using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google distribution via page hit counts, Google code, Kolmogorov complexity, normalized compression distance (NCD), normalized information distance (NID), normalized Google distance (NGD), meaning of words and phrases extracted from the Web, parameter-free data mining, universal similarity metric. Ç 1
Clustering
, 2009
"... The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The problem is to construct an optimal weight tree from the 3 () n 4 weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The method has been extensively used for general hierarchical clustering of nontreelike (non-phylogeny) data in various domains and across domains with heterogenous data, and is implemented and available, as part of the CompLearn package. We compare performance and running time with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.
1 Reducing the Loss of Information through Annealing Text Distortion
"... Abstract — Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper we take a step toward ..."
Abstract
- Add to MetaCart
Abstract — Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper we take a step towards understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the non-distorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-
Information Distance in Multiples 1
"... Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universal ..."
Abstract
- Add to MetaCart
Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity, and normalized information distance in multiples. We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.
Normalized Web Distance and Word Similarity
, 2009
"... Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like “the four-letter genome of a ..."
Abstract
- Add to MetaCart
Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like “the four-letter genome of a mouse, ” or “the text of War and Peace by Tolstoy. ” There are also objects that cannot be given literally, but

