Results 1 - 10
of
16
The reuters corpus volume 1 - from yesterday’s news to tomorrow’s language resources
- In Proceedings of the Third International Conference on Language Resources and Evaluation
, 2002
"... Reuters, the global information, news and technology group, has for the first time made available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. The Reuters Corpus Volume 1 (RCV1) includes over 800,000 news stories- typical of the ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Reuters, the global information, news and technology group, has for the first time made available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. The Reuters Corpus Volume 1 (RCV1) includes over 800,000 news stories- typical of the annual English language news output of Reuters. This paper describes the origins of RCV1, the motivations behind its creation, and how it differs from previous corpora. In addition we discuss the system of category coding, whereby each story is annotated for topic, region and industry sector. We also discuss the process by which these codes were applied, and examine the issues involved in maintaining quality and consistency of coding in an operational, commercial environment. 1.
Development and Use of a Gold-Standard Data Set for Subjectivity Classifications
, 1999
"... and improving intercoder reliability in discourse tagging using statistical techniques. Biascorrected tags axe formulated and successfully used to guide a revision of the coding manual and develop an automatic classifier. ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
and improving intercoder reliability in discourse tagging using statistical techniques. Biascorrected tags axe formulated and successfully used to guide a revision of the coding manual and develop an automatic classifier.
HyperLex: Lexical Cartography for Information Retrieval
- TO APPEAR IN COMPUTER SPEECH AND LANGUAGE SPECIAL ISSUE ON WORD SENSE DISAMBIGUATION
"... This article describes an algorithm called HyperLex that is capable of automatically determining word uses in a textbase without recourse to a dictionary. The algorithm makes use of the specific properties of word cooccurrence graphs, which are shown as having "small world" properties. Unl ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
This article describes an algorithm called HyperLex that is capable of automatically determining word uses in a textbase without recourse to a dictionary. The algorithm makes use of the specific properties of word cooccurrence graphs, which are shown as having "small world" properties. Unlike earlier dictionary-free methods based on word vectors, it can isolate highly infrequent uses (as rare as 1 % of all occurrences) by detecting "hubs " and high-density components in the cooccurrence graphs. The algorithm is applied here to information retrieval on the Web, using a set of highly ambiguous test words. An evaluation of the algorithm showed that it only omitted a very small number of relevant uses. In addition, HyperLex offers automatic tagging of word uses in context with excellent precision (97%, compared to 73 % for baseline tagging, with an 82 % recall rate). Remarkably good precision (96%) was also achieved on a selection of the 25 most relevant pages for each use (including highly infrequent ones). Finally, HyperLex is combined with a graphic display technique that allows the user to navigate visually through the lexicon and explore the various domains detected for each word use.
Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs
- In Computer and the Humanities
, 1998
"... There are now many computer programs for automatically determining the sense in which a word is being used. One would like to be able to say which are better, which worse, and also which words, or varieties of language, present particular problems to which algorithms. An evaluation exercise is requi ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
There are now many computer programs for automatically determining the sense in which a word is being used. One would like to be able to say which are better, which worse, and also which words, or varieties of language, present particular problems to which algorithms. An evaluation exercise is required, and such an exercise requires a `gold standard' dataset of correct answers. Producing this proves to be a difficult and challenging task. In this paper I discuss the background, challenges and strategies, and present a detailed methodology for ensuring that the gold standard is not fool's gold. 1 Introduction There are now many computer programs for automatically determining the sense in which a word is being used. One would like to be able to say which are better, which worse, and also which words, or varieties of language, present particular problems to which algorithms. An evaluation exercise is required. A pilot (`SENSEVAL') is taking place under the auspices of ACL SIGLEX (the Le...
Automatic Sense Tagging Using Parallel Corpora
- In Proceedings of the 6 th Natural Language Processing Pacific Rim Symposium
, 2001
"... This article reports the results of an analysis of translation equivalents in six languages from different language families, automatically extracted from an on-line 7-way parallel corpus of George Orwell’s Nineteen Eighty-Four. The goal is to determine sense distinctions that can be used to automat ..."
Abstract
-
Cited by 20 (10 self)
- Add to MetaCart
This article reports the results of an analysis of translation equivalents in six languages from different language families, automatically extracted from an on-line 7-way parallel corpus of George Orwell’s Nineteen Eighty-Four. The goal is to determine sense distinctions that can be used to automatically sense-tag the data. Our results show that sense distinctions derived from cross-lingual information correspond to those made by human annotators, especially at the coarse-grained level. We also show that the reliability of sense assignments at finer-grained levels is comparable for human annotators and those produced automatically with cross-lingual data. 1
Inter-annotator agreement on a multilingual semantic annotation task
- In Proceedings of LREC
, 2006
"... ..."
Building an annotated corpus in the molecular biology domain
- Proc. COLING SAIC Workshop
, 2000
"... Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning.With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to searc ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning.With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections.To support this we are building a corpus of annotated abstracts taken from National Library of Medicine’s MEDLINE database.In this paper we report on this new corpus, its ontological basis, and our experience in designing the annotation scheme.Experimental results are shown for inter-annotator agreement and comments are made on methodological considerations. 1
Sense tagging: does It make sense?
- Corpus Linguistics’2001 Conference
, 2001
"... Sense tagging is probably one of the challenges that corpus linguists have to face in the near future. So far, computerisation of this task has yielded very modest results despite numerous efforts, and sense tagging is turning out to be a touchy task. Difficulties stem from various sources, extracti ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Sense tagging is probably one of the challenges that corpus linguists have to face in the near future. So far, computerisation of this task has yielded very modest results despite numerous efforts, and sense tagging is turning out to be a touchy task. Difficulties stem from various sources, extracting disambiguating information from the context. However, one of the main problems that lies upstream of the disambiguating process is the sense inventory itself. Most tagging efforts rely on traditional dictionaries to supply the reference senses, or on computer-oriented resources such as WordNet, which do not differ significantly from traditional dictionaries in terms of sense division. The present paper shows that human taggers perform very poorly when given a traditional dictionary as the reference, and that machines should therefore not be expected to perform any better if the same kind of resource is used. A detailed analysis reveals the lack of distributional criteria in dictionary entries: traditional dictionaries are chiefly concerned with meaning definition, and not with the surface clues (syntactic, collocational, etc.) that are required to match a given sense with a given corpus occurrence. It is argued that no fundamental progress can be made until large-scale lexical resources have been built that incorporate extensive distributional information, and that, until that time, any massive sense tagging efforts based on traditional dictionaries or computer-oriented resources such as WordNet would not only be premature but also questionable in terms of resource management.
Tree-cut and A Lexicon based on Systematic Polysemy
- In Proceedings of the North American Chapter of the Association for Computational Linguistics
, 2001
"... This paper describes a lexicon organized around systematic polysemy: a set of word senses that are related in systematic and predictable ways. The lexicon is derived by a fully automatic extraction method which utilizes a clustering technique called tree-cut. We compare our lexicon to WordNet cousi ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper describes a lexicon organized around systematic polysemy: a set of word senses that are related in systematic and predictable ways. The lexicon is derived by a fully automatic extraction method which utilizes a clustering technique called tree-cut. We compare our lexicon to WordNet cousins, and the inter-annotator disagreement observed between WordNet Semcor and DSO corpora.
Performance Metrics for Word Sense Disambiguation
"... This paper presents the area under the Receiver Operating Characteristics (ROC) curve as an alternative metric for evaluating word sense disambiguation performance. The current metrics – accuracy, precision and recall – while suitable for two-way classification, are shown to be inadequate when disam ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents the area under the Receiver Operating Characteristics (ROC) curve as an alternative metric for evaluating word sense disambiguation performance. The current metrics – accuracy, precision and recall – while suitable for two-way classification, are shown to be inadequate when disambiguating between three or more senses. Specifically, these measures do not facilitate comparison with baseline performance nor are they sensitive to non-uniform misclassification costs. Both of these issues can be addressed using ROC analysis. 1

