Results 1 -
7 of
7
Finding Predominant Word Senses in Untagged Text
- In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics
, 2004
"... In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of handtagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64% on the nouns of the SENSEVAL- 2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text, such as SemCor. Furthermore, we demonstrate that our method discovers appropriate predominant senses for words from two domainspecific corpora.
unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling
- In HLT/EMNLP 2005
, 2005
"... This paper introduces a graph-based algorithm for sequence data labeling, using random walks on graphs encoding label dependencies. The algorithm is illustrated and tested in the context of an unsupervised word sense disambiguation problem, and shown to significantly outperform the accuracy achieved ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
This paper introduces a graph-based algorithm for sequence data labeling, using random walks on graphs encoding label dependencies. The algorithm is illustrated and tested in the context of an unsupervised word sense disambiguation problem, and shown to significantly outperform the accuracy achieved through individual label assignment, as measured on standard senseannotated data sets. 1
Deriving Generalized Knowledge from Corpora using WordNet Abstraction
- Proc. EACL'09
, 2009
"... Existing work in the extraction of commonsense knowledge from text has been primarily restricted to factoids that serve as statements about what may possibly obtain in the world. We present an approach to deriving stronger, more general claims by abstracting over large sets of factoids. Our goal is ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Existing work in the extraction of commonsense knowledge from text has been primarily restricted to factoids that serve as statements about what may possibly obtain in the world. We present an approach to deriving stronger, more general claims by abstracting over large sets of factoids. Our goal is to coalesce the observed nominals for a given predicate argument into a few predominant types, obtained as WordNet synsets. The results can be construed as generically quantified sentences restricting the semantic type of an argument position of a predicate. 1
The "Meaning" System on the English Allwords Task
"... Introduction The "Meaning" system has been developed within the framework of the Meaning European research project . It is a combined system, which integrates several supervised machine learning word sense disambiguation modules, and several knowledge-- based (unsupervised) modules. See section ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Introduction The "Meaning" system has been developed within the framework of the Meaning European research project . It is a combined system, which integrates several supervised machine learning word sense disambiguation modules, and several knowledge-- based (unsupervised) modules. See section 2 for details. The supervised modules have been trained exclusively on the SemCor corpus, while the unsupervised modules use WordNet-based lexico--semantic resources integrated in the Multilingual Central Repository (MCR) of the Meaning project (Atserias et al., 2004). The architecture of the system is quite simple. Raw text is passed through a pipeline of linguistic processors (tokenizers, POS tagging, named entity extraction, and parsing) and then a Feature Extraction module codifies examples with features extracted from the linguistic annotation and MCR. The supervised modules have priority over the unsupervised and they are combined using a weighted voting scheme. For the words lacking
Automatic Identification of Infrequent Word Senses
- In Proceedings of the 20th International Conference of Computational Linguistics, COLING-2004
, 2004
"... In this paper we show that an unsupervised method for ranking word senses automatically can be used to identify infrequently occurring senses. We demonstrate this using a ranking of noun senses derived from the BNC and evaluating on the sense-tagged text available in both SemCor and the SENSEVAL-2 E ..."
Abstract
- Add to MetaCart
In this paper we show that an unsupervised method for ranking word senses automatically can be used to identify infrequently occurring senses. We demonstrate this using a ranking of noun senses derived from the BNC and evaluating on the sense-tagged text available in both SemCor and the SENSEVAL-2 English all-words task. We show that the method does well at identifying senses that do not occur in a corpus, and that those that are erroneously filtered but do occur typically have a lower frequency than the other senses. This method should be useful for word sense disambiguation systems, allowing effort to be concentrated on more frequent senses; it may also be useful for other tasks such as lexical acquisition. Whilst the results on balanced corpora are promising, our chief motivation for the method is for application to domain specific text. For text within a particular domain many senses from a generic inventory will be rare, and possibly redundant. Since a large domain specific corpus of sense annotated data is not available, we evaluate our method on domain-specific corpora and demonstrate that sense types identified for removal are predominantly senses from outside the domain.
Using Semantic Distance to Automatically Suggest Transfer Course Equivalencies
"... Semantic distance is the degree of closeness between two pieces of text determined by their meaning. Semantic distance is typically measured by analyzing a set of documents or a list of terms and assigning a metric based on the likeness of their meaning or the concept they represent. Although relate ..."
Abstract
- Add to MetaCart
Semantic distance is the degree of closeness between two pieces of text determined by their meaning. Semantic distance is typically measured by analyzing a set of documents or a list of terms and assigning a metric based on the likeness of their meaning or the concept they represent. Although related research provides some semantic-based algorithms, few applications exist. This work proposes a semanticbased approach for automatically identifying potential course equivalencies given their catalog descriptions. The method developed by Li et al. (2006) is extended in this paper to take a course description from one university as the input and suggest equivalent courses offered at another university. Results are evaluated and future work is discussed. of institutions, it is not always up to date and the data set is sparse and non-uniformed. This work proposes an approach to automatically identify course equivalencies by analyzing the course descriptions and comparing their semantic distance. The course descriptions are first pruned and unrelated contexts are removed. Given a course from another university, the algorithm measures word, sentence, and paragraph similarities to suggest a list of potentially equivalent courses offered by UML. This work has two goals: (1) to efficiently and accurately suggest equivalent courses to reduce the workload of transfer coordinators, and (2) to explore new applications using semantic distance to move toward the Semantic Web, i.e., to turn existing resources into knowledge structures. 1
Efficient Hierarchical Entity Classifier Using Conditional Random Fields
"... In this paper we develop an automatic classifier for a very large set of labels, the WordNet synsets. We employ Conditional Random Fields (CRFs) because of their flexibility to include a wide variety of nonindependent features. Training CRFs on a big number of labels proved a problem because of the ..."
Abstract
- Add to MetaCart
In this paper we develop an automatic classifier for a very large set of labels, the WordNet synsets. We employ Conditional Random Fields (CRFs) because of their flexibility to include a wide variety of nonindependent features. Training CRFs on a big number of labels proved a problem because of the large training cost. By taking into account the hypernym/hyponym relation between synsets in WordNet, we reduced the complexity of training from O(TM 2 NG) to O(T (logM) 2 NG) with only a limited loss in accuracy. 1

