Results 1 - 10
of
35
P.M.B.: The Google similarity distance
- IEEE Transactions on Knowledge and Data Engineering
, 2007
"... Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarit ..."
Abstract
-
Cited by 98 (4 self)
- Add to MetaCart
Abstract—Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of “society ” is “database, ” and the equivalent of “use ” is “a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories. Index Terms—Accuracy comparison with WordNet categories, automatic classification and clustering, automatic meaning discovery using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google distribution via page hit counts, Google code, Kolmogorov complexity, normalized compression distance (NCD), normalized information distance (NID), normalized Google distance (NGD), meaning of words and phrases extracted from the Web, parameter-free data mining, universal similarity metric. Ç 1
Learning Concept Hierarchies from Text Corpora Using Formal Concept Analysis
- Journal of Artificial Intelligence research
, 2005
"... We present a novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus. The approach is based on Formal Concept Analysis (FCA), a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information. We follow Ha ..."
Abstract
-
Cited by 73 (4 self)
- Add to MetaCart
We present a novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus. The approach is based on Formal Concept Analysis (FCA), a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information. We follow Harris' distributional hypothesis and model the context of a certain term as a vector representing syntactic dependencies which are automatically acquired from the text corpus with a linguistic parser. On the basis of this context information, FCA produces a lattice that we convert into a special kind of partial order constituting a concept hierarchy. The approach is evaluated by comparing the resulting concept hierarchies with hand-crafted taxonomies for two domains: tourism and finance. We also directly compare our approach with hierarchical agglomerative clustering as well as with Bi-Section-KMeans as an instance of a divisive clustering algorithm. Furthermore, we investigate the impact of using different measures weighting the contribution of each attribute as well as of applying a particular smoothing technique to cope with data sparseness.
Gimme’ The Context: Context-driven Automatic Semantic Annotation with C-PANKOW
, 2005
"... Without the proliferation of formal semantic annotations, the Semantic Web is certainly doomed to failure. In earlier work we presented a new paradigm to avoid this: the ’Self Annotating Web’, in which globally available knowledge is used to annotate resources such as web pages. In particular, we pr ..."
Abstract
-
Cited by 60 (2 self)
- Add to MetaCart
Without the proliferation of formal semantic annotations, the Semantic Web is certainly doomed to failure. In earlier work we presented a new paradigm to avoid this: the ’Self Annotating Web’, in which globally available knowledge is used to annotate resources such as web pages. In particular, we presented a concrete method instantiating this paradigm, called PANKOW (Pattern-based ANnotation through Knowledge On the Web). In PANKOW, a named entity to be annotated is put into several linguistic patterns that convey competing semantic meanings. The patterns that are matched most often on the Web indicate the meaning of the named entity — leading to automatic or semi-automatic annotation. In this paper we present C-PANKOW (Context-driven PANKOW), which alleviates several shortcomings of PANKOW. First, by downloading abstracts and processing them off-line, we avoid the generation of large number of linguistic patterns and correspondingly large number of Google queries. Second, by linguistically analyzing and normalizing the downloaded abstracts, we increase the coverage of our pattern matching mechanism and overcome several limitations of the earlier pattern generation process. Third, we use the annotation context in order to distinguish the significance of a pattern match for the given annotation task. Our experiments show that C-PANKOW inherits all the advantages of PANKOW (no training required etc.), but in addition it is far more efficient and effective.
Automatic Meaning Discovery Using Google
- Manuscript, CWI, 2004; http://arxiv.org/abs/cs.CL/0412098
, 2004
"... We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest dat ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, and emergency incidents; we conduct a massive experiment in understanding WordNet categories; and finally we demonstrate the ability to do a simple automatic English-Spanish translation.
A Method to Combine Linguistic Ontology-Mapping Techniques
- In International Semantic Web Conference
, 2005
"... Abstract. We discuss four linguistic ontology-mapping techniques and evaluate them on real-life ontologies in the domain of food. Furthermore we propose a method to combine ontology-mapping techniques with high Precision and Recall to reduce the necessary amount of manual labor and computation. 1 ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Abstract. We discuss four linguistic ontology-mapping techniques and evaluate them on real-life ontologies in the domain of food. Furthermore we propose a method to combine ontology-mapping techniques with high Precision and Recall to reduce the necessary amount of manual labor and computation. 1
2006a. Learning effective surface text patterns for information extraction
- Proceedings of the EACL Workshop on Adaptive Text Extraction and Mining. 1–8
"... We present a novel method to identify effective surface text patterns using an internet search engine. Precision is only one of the criteria to identify the most effective patterns among the candidates found. Another aspect is frequency of occurrence. Also, a pattern has to relate diverse instances ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We present a novel method to identify effective surface text patterns using an internet search engine. Precision is only one of the criteria to identify the most effective patterns among the candidates found. Another aspect is frequency of occurrence. Also, a pattern has to relate diverse instances if it expresses a non-functional relation. The learned surface text patterns are applied in an ontology population algorithm, which not only learns new instances of classes but also new instancepairs of relations. We present some £rst experiments with these methods. 1
Entity extraction via ensemble semantics
- In Proc. of EMNLP
, 2009
"... Combining information extraction systems yields significantly higher quality resources than each system in isolation. In this paper, we generalize such a mixing of sources and features in a framework called Ensemble Semantics. We show very large gains in entity extraction by combining state-of-the-a ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Combining information extraction systems yields significantly higher quality resources than each system in isolation. In this paper, we generalize such a mixing of sources and features in a framework called Ensemble Semantics. We show very large gains in entity extraction by combining state-of-the-art distributional and patternbased systems with a large set of features from a webcrawl, query logs, and Wikipedia. Experimental results on a webscale extraction of actors, athletes and musicians show significantly higher mean average precision scores (29 % gain) compared with the current state of the art. 1
Using the Web to resolve coreferent bridging in German newspaper text
, 2007
"... Abstract. We adopt Markert and Nissim (2005)’s approach of using the World Wide Web to resolve cases of coreferent bridging for German and discuss the strength and weaknesses of this approach. As the general approach of using surface patterns to get information on ontological relations between lexic ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. We adopt Markert and Nissim (2005)’s approach of using the World Wide Web to resolve cases of coreferent bridging for German and discuss the strength and weaknesses of this approach. As the general approach of using surface patterns to get information on ontological relations between lexical items has only been tried on English, it is also interesting to see whether the approach works for German as well as it does for English and what differences between these languages need to be accounted for. We also present a novel approach for combining several patterns that yields an ensemble that outperforms the best-performing single patterns in terms of both precision and recall. 1
Coreference Resolution Using Semantic Relatedness Information from Automatically Discovered Patterns
"... Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpusbased approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the mos ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpusbased approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the most effective ones in terms of accuracy and breadth. To deal with this problem, in this paper we propose an approach that can automatically find the effective patterns for coreference resolution. We explore how to automatically discover and evaluate patterns, and how to exploit the patterns to obtain the semantic relatedness information. The evaluation on ACE data set shows that the pattern based semantic information is helpful for coreference resolution. 1
Automatic Ontology Population by Googling
- In: Proceedings of the 17th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC
, 2005
"... We discuss a method to populate ontologies with the use of googled text fragments. We populate an ontology by the use of hand-crafted domain-specific relation patterns, which can be seen as a generalization of Hearst patterns. The algorithm described uses instances of some class returned by Googl ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We discuss a method to populate ontologies with the use of googled text fragments. We populate an ontology by the use of hand-crafted domain-specific relation patterns, which can be seen as a generalization of Hearst patterns. The algorithm described uses instances of some class returned by Google to find instances of other classes. A case study on populating an ontology on the movie domain is presented as an illustration of the method. We present the algorithm in detail and discuss the results of our work.

