Results 1 - 10
of
45
Using the Web to Obtain Frequencies for Unseen Bigrams
- Computational Linguistics
, 2003
"... This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: ( ..."
Abstract
-
Cited by 104 (2 self)
- Add to MetaCart
This article shows that the Web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the Web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between Web frequencies and corpus frequencies; (b) a reliable correlation between Web frequencies and plausibility judgments; (c) a reliable correlation between Web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of Web frequencies in a pseudodisambiguation task. 1.
WordNet 2 - A Morphologically and Semantically Enhanced Resource
- University of Maryland
, 1999
"... This paper presents an on-going project intended to enhance WordNet morphologically and semantically. The motivation for this work steams from the current limitations of WordNet when used as a linguistic knowledge base. We envision a software tool that automatically parses the conceptual defining gl ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
This paper presents an on-going project intended to enhance WordNet morphologically and semantically. The motivation for this work steams from the current limitations of WordNet when used as a linguistic knowledge base. We envision a software tool that automatically parses the conceptual defining glosses, attributing part-of-speech tags and phrasal brackets. The nouns, verbs, adjectives and adverbs from every de nition are then disambiguated and linked to the corresponding synsets. This increases the connectivity between synsets allowing the retrieval of topically related concepts. Furthermore, the tool transforms the glosses, first into logical forms, and then into semantic forms. Using derivational morphology new links are added between the synsets. 1 Motivation WordNet has already been recognized as a valuable resource in the human language technology and knowledge processing communities. Its applicability has been cited in more than 200 papers and systems have been...
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
Word Translation Disambiguation Using Bilingual Bootstrapping
- COMPUTATIONAL LINGUISTICS
, 2002
"... This paper proposes a new method for word translation disambiguation using a machine learning technique called `Bilingual Bootstrapping'. Bilingual Bootstrapping makes use of # in learning# a small number of classified data and a large number of unclassified data in the source and the tar ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
This paper proposes a new method for word translation disambiguation using a machine learning technique called `Bilingual Bootstrapping'. Bilingual Bootstrapping makes use of # in learning# a small number of classified data and a large number of unclassified data in the source and the target languages in translation. It constructs classifiers in the two languages in parallel and repeatedly boosts the performances of the classifiers by further classifying data in each of the two languages and by exchanging between the two languages information regarding the classified data. Experimental results indicate that word translation disambiguation based on Bilingual Bootstrapping consistently and significantly outperforms the existing methods based on `Monolingual Bootstrapping'.
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
Using the Web to Overcome Data Sparseness
- In Proceedings of EMNLP-02
, 2002
"... This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstratin ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstrating that web frequencies and correlate with frequencies obtained from a carefully edited, balanced corpus.
Experiments in Word Domain Disambiguation for Parallel Texts
- In Proceedings of the ACL Workshop on Word Senses and Multilinguality
, 2000
"... This paper describes some preliminary results about Word Domain Disambiguation, a variant of Word Sense Disambignation where words in a text are tagged with a domain label in place of a sense label. The English WORDNET and its aligned Italian version, MULTIWORDNET, both augmented with domain ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
This paper describes some preliminary results about Word Domain Disambiguation, a variant of Word Sense Disambignation where words in a text are tagged with a domain label in place of a sense label. The English WORDNET and its aligned Italian version, MULTIWORDNET, both augmented with domain labels, are used as the main information repositories. A baseline algorithm for Word Domain Disambignation is presented and then compared with a mutual help disambignation strategy, which takes advantages of the shared senses of parallel, bilingual texts.
Semantic knowledge construction from annotated image collections
- Proceedings of IEEE International Conference on Multimedia
, 2002
"... This paper presents new methods for extracting semantic knowledge from collections of annotated images. The proposed methods include novel automatic techniques for extracting semantic concepts by disambiguating the senses of words in the annotations using the lexical database WordNet, and both the i ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
This paper presents new methods for extracting semantic knowledge from collections of annotated images. The proposed methods include novel automatic techniques for extracting semantic concepts by disambiguating the senses of words in the annotations using the lexical database WordNet, and both the images and their annotations, and for discovering semantic relations among the detected concepts based on WordNet. Another contribution of this paper is the evaluation of several techniques for visual feature descriptor extraction and data clustering in the extraction of semantic concepts. Experiments show the potential of integrating the analysis of both images and annotations for improving the performance of the word-sense disambiguation process. In particular, the accuracy improves 4-15 % with respect to the baselines systems for nature images. 1.
An Iterative Approach to Word Sense Disambiguation
- In Proceedings of FLAIRS-2000
, 2000
"... In this paper, we present an iterative algorithm for Word Sense Disambiguation. It combines two sources of information: WordNet and a semantic tagged corpus, for the purpose of identifying the correct sense of the words in a given text. It differs from other standard approaches in that the dis ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
In this paper, we present an iterative algorithm for Word Sense Disambiguation. It combines two sources of information: WordNet and a semantic tagged corpus, for the purpose of identifying the correct sense of the words in a given text. It differs from other standard approaches in that the disambiguation process is performed in an iterative manner: starting from free text, a set of disambiguated words is built, using various methods; new words are sense tagged based on their relation to the already disambiguated words, and then added to the set. This iterative process allows us to identify, in the original text, a set of words which can be disambiguated with high precision; 55% of the verbs and nouns are disambiguated with an accuracy of 92%. Introduction Word Sense Disambiguation (WSD) is an open problem in Natural Language Processing (NLP). Its solution impacts other tasks such as information retrieval, machine translation, discourse, reference resolution and others....

