Results 1 - 10
of
77
Using the web to obtain frequencies for unseen bigrams. Comput. Linguist
, 2003
"... This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the web by querying a search engine. We evaluate this method by demonstrating: (a) ..."
Abstract
-
Cited by 169 (2 self)
- Add to MetaCart
This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between web frequencies and corpus frequencies; (b) a reliable correlation between web frequencies and plausibility judgments; (c) a reliable correlation between web frequencies and frequencies recreated using class-based smoothing; (d) a good performance of web frequencies in a pseudo-disambiguation task.
WordNet 2 - A Morphologically and Semantically Enhanced Resource
- University of Maryland
, 1999
"... This paper presents an on-going project intended to enhance WordNet morphologically and semantically. The motivation for this work steams from the current limitations of WordNet when used as a linguistic knowledge base. We envision a software tool that automatically parses the conceptual defining gl ..."
Abstract
-
Cited by 84 (6 self)
- Add to MetaCart
(Show Context)
This paper presents an on-going project intended to enhance WordNet morphologically and semantically. The motivation for this work steams from the current limitations of WordNet when used as a linguistic knowledge base. We envision a software tool that automatically parses the conceptual defining glosses, attributing part-of-speech tags and phrasal brackets. The nouns, verbs, adjectives and adverbs from every de nition are then disambiguated and linked to the corresponding synsets. This increases the connectivity between synsets allowing the retrieval of topically related concepts. Furthermore, the tool transforms the glosses, first into logical forms, and then into semantic forms. Using derivational morphology new links are added between the synsets. 1 Motivation WordNet has already been recognized as a valuable resource in the human language technology and knowledge processing communities. Its applicability has been cited in more than 200 papers and systems have been...
Web-based models for natural language processing
- ACM Transactions on Speech and Language Processing
, 2005
"... Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The pr ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-ofthe-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
TopCat: Data Mining for Topic Identification in a Text Corpus
- In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
, 2002
"... TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a dat ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
(Show Context)
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional" data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth" news corpus showing this technique is effective in identifying "topics" in collections of news articles.
Word Translation Disambiguation Using Bilingual Bootstrapping
- COMPUTATIONAL LINGUISTICS
, 2002
"... This paper proposes a new method for word translation disambiguation using a machine learning technique called `Bilingual Bootstrapping'. Bilingual Bootstrapping makes use of # in learning# a small number of classified data and a large number of unclassified data in the source and th ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
This paper proposes a new method for word translation disambiguation using a machine learning technique called `Bilingual Bootstrapping'. Bilingual Bootstrapping makes use of # in learning# a small number of classified data and a large number of unclassified data in the source and the target languages in translation. It constructs classifiers in the two languages in parallel and repeatedly boosts the performances of the classifiers by further classifying data in each of the two languages and by exchanging between the two languages information regarding the classified data. Experimental results indicate that word translation disambiguation based on Bilingual Bootstrapping consistently and significantly outperforms the existing methods based on `Monolingual Bootstrapping'.
Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets
- In: Proceedings of the Atlantic Web Intelligence Conference, AWIC-2005. Volume 3528 of Lecture Notes in Computer Science
, 2005
"... Abstract. We describe an approach taken for automatically associating entries from an on-line encyclopedia with concepts in an ontology or a lexical semantic network. It has been tested with the Simple English Wikipedia and WordNet, although it can be used with other resources. The accuracy in disam ..."
Abstract
-
Cited by 44 (3 self)
- Add to MetaCart
(Show Context)
Abstract. We describe an approach taken for automatically associating entries from an on-line encyclopedia with concepts in an ontology or a lexical semantic network. It has been tested with the Simple English Wikipedia and WordNet, although it can be used with other resources. The accuracy in disambiguating the sense of the encyclopedia entries reaches 91.11 % (83.89 % for polysemous words). It will be applied to enriching ontologies with encyclopedic knowledge. 1
Web-Scale N-gram Models for Lexical Disambiguation
"... Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combin ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combine information from multiple and overlapping segments of context. On the tasks of preposition selection and context-sensitive spelling correction, the supervised system reduces disambiguation error by 20-24 % over the current state-of-the-art. 1
Using the Web to Overcome Data Sparseness
- In Proceedings of EMNLP-02
, 2002
"... This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstratin ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verbobject bigrams from the web by querying a search engine. We evaluate this method by demonstrating that web frequencies and correlate with frequencies obtained from a carefully edited, balanced corpus.
Web as corpus
- Lancaster University
, 2001
"... The web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists ’ playground. The Special Issue explores ways in which this dream is being explored. 1 ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
(Show Context)
The web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists ’ playground. The Special Issue explores ways in which this dream is being explored. 1