Results 1 - 10
of
40
P.M.B.: The Google similarity distance
- IEEE Transactions on Knowledge and Data Engineering
, 2007
"... Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarit ..."
Abstract
-
Cited by 98 (4 self)
- Add to MetaCart
Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories. Index Terms—Accuracy comparison with WordNet categories, automatic classification and clustering, automatic meaning discovery using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google distribution via page hit counts, Google code, Kolmogorov complexity, normalized compression distance (NCD), normalized information distance (NID), normalized Google distance (NGD), meaning of words and phrases extracted from the Web, parameter-free data mining, universal similarity metric. Ç 1
Similarity of semantic relations
- Computational Linguistics
, 2006
"... There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47 % on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56 % on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM. 1.
Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems
- In Proceedings of the International Conference on Recent Advances in Natural Language Processing
, 2003
"... Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing. ..."
Abstract
-
Cited by 40 (8 self)
- Add to MetaCart
Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing.
Name disambiguation in author citations using a K-way spectral clustering method
- INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES
, 2005
"... An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles 2. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes.
Automatic Meaning Discovery Using Google
- Manuscript, CWI, 2004; http://arxiv.org/abs/cs.CL/0412098
, 2004
"... We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest dat ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, and emergency incidents; we conduct a massive experiment in understanding WordNet categories; and finally we demonstrate the ability to do a simple automatic English-Spanish translation.
Extracting semantic representations from word co-occurrence statistics: A computational study
- Behavior Research Methods
, 2007
"... Abstract: In a previous paper we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of Pointwise Mutual Information (PMI) values from very small co-occurr ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
Abstract: In a previous paper we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of Pointwise Mutual Information (PMI) values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This paper extends that study by investigating the use of three further factors, namely the application of stop-lists, word stemming, and dimensionality reduction using Singular Value Decomposition (SVD), that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
Expressing implicit semantic relations without supervision
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2006
, 2006
"... We present an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. For a given input word pair X: Y with some unspecified semantic relations, the corresponding output list of patterns P 1, , Pm is ranked according to how well each patte ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We present an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. For a given input word pair X: Y with some unspecified semantic relations, the corresponding output list of patterns P 1, , Pm is ranked according to how well each pattern P i expresses the relations between X and Y. For example, given X = ostrich and Y = bird, the two highest ranking output patterns are “X is the largest Y ” and “Y such as the X”. The output patterns are intended to be useful for finding further pairs with the same relations, to support the construction of lexicons, ontologies, and semantic networks. The patterns are sorted by pertinence, where the pertinence of a pattern P i for a word pair X: Y is the expected relational similarity between the given pair and typical pairs for P i. The algorithm is empirically evaluated on two tasks, solving multiple-choice SAT word analogy questions and classifying semantic relations in noun-modifier pairs. On both tasks, the algorithm achieves stateof-the-art results, performing significantly better than several alternative pattern ranking algorithms, based on tf-idf. 1
Web Text Corpus for Natural Language Processing
, 2006
"... Web text has been successfully used as training data for many NLP applications. ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Web text has been successfully used as training data for many NLP applications.
The hiding virtues of ambiguity: Quantifiably resilient watermarking of natural language text through synonym substitutions
- In Proceedings of ACM Multimedia and Security Workshop
"... Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions w ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text “confessed”, i.e., carried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer entirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hiding in natural language text did not use numeric quantification of the distortions introduced by transformations, they mainly used heuristic measures of quality based on conformity to a language model (and not in reference to the original cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alternatives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more ambiguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultaneously be as close as possible to the above-mentioned distortion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has “confessed ” to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information-
Identifying Subjective Adjectives through Web-based Mutual Information
- In Proceedings of the 7th Konferenz zur Verarbeitung Natürlicher Sprache (German Conference on Natural Language Processing – KONVENS’04
, 2004
"... This paper describes a method for ranking a large list of adjectives according to a subjectivity score without resorting to any knowledge-intensive external resources (such as lexical databases, parsers or manual annotation). The method only requires a list of adjectives to be ranked and a small set ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes a method for ranking a large list of adjectives according to a subjectivity score without resorting to any knowledge-intensive external resources (such as lexical databases, parsers or manual annotation). The method only requires a list of adjectives to be ranked and a small set of "seeds" (manually selected subjective adjectives). The subjectivity score is obtained by computing the mutual information of pairs of adjectives taken from each set, using frequency and cooccurrence frequency counts on the World Wide Web, collected through queries to the AltaVista search engine. The obtained results improve significantly over a comparable low-resource acquisition algorithm.

