Results 1 - 10
of
18
Improvements in automatic thesaurus extraction
- IN PROCEEDINGS OF THE WORKSHOP ON UNSUPERVISED LEXICAL ACQUISITION
, 2002
"... The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction per ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
Scaling context space
- In Proceedings of the 40th annual meeting of the Association for Computational Linguistics
, 2002
"... Context is used in many NLP systems as an indicator of a term’s syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resource ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Context is used in many NLP systems as an indicator of a term’s syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words. 1
Measures and Applications of Lexical Distributional Similarity
, 2003
"... This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, s ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, statistics, Information Retrieval (IR) and Information Theory. Our aim is to investigate the properties which make a good measure of lexical distributional similarity. We start by introducing the concept of lexical distributional similarity. We discuss potential applications, which can be roughly divided into distributional or language modelling applications and semantic applications, and methods of evaluation (Chapter 2). We look at existing measures of distributional similarity and carry out an empirical comparison of fifteen of these measures, paying particular attention to the effects of word frequency (Chapter 3). We propose a new general framework for distributional similarity based on the context of lexical substitutability, which me measure using the IR concepts of precision and recall. This framework allows us to investigate the key factors in similarity of asymmetry, the relative influence of different contexts and the extent to which words share a context (Chapter 4). Finally, we consider the application of distributional similarity in language modelling (Chapter 5) and as a predictor of semantic similarity using human judgements of similarity and a spelling correction task (Chapter 6).
Corpus-Based Thesaurus Construction for Image Retrieval in Specialist Domains
- in Proceedings of the 25th European Conference on Advances in Information Retrieval (ECIR
, 2003
"... This paper explores the use of texts that are related to an image collection, also known as collateral texts, for building thesauri in specialist domains to aid in image retrieval. Corpus linguistic and information extraction methods are used for identifying key terms and conceptual relationships ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
This paper explores the use of texts that are related to an image collection, also known as collateral texts, for building thesauri in specialist domains to aid in image retrieval. Corpus linguistic and information extraction methods are used for identifying key terms and conceptual relationships in specialist texts that may be used for query expansion purposes. The specialist domain context imposes certain constraints on the language used in the texts, which makes the texts computationally more tractable. The effectiveness of such an approach is demonstrated through a prototype system that has been developed for the storage and retrieval of images and texts, applied in the forensic science domain.
Ensemble Methods for Automatic Thesaurus Extraction
- IN PROC. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING
, 2002
"... Ensemble methods are state of the art for many NLP tasks. Recent work by Banko and Brill (2001) suggests that this would not necessarily be true if very large training corpora were available. However, their results are limited by the simplicity of their evaluation task and individual classifi ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Ensemble methods are state of the art for many NLP tasks. Recent work by Banko and Brill (2001) suggests that this would not necessarily be true if very large training corpora were available. However, their results are limited by the simplicity of their evaluation task and individual classifiers. Our work
Automatic acquisition of lexico-semantic knowledge for question answering
- In Proceedings of Ontolex 2005 – Ontologies and Lexical Resources, Jeju Island, South Korea
, 2005
"... We present an experiment for finding semantically similar words on the basis of a parsed corpus of Dutch text and show that the acquired information correlates with relations found in Dutch EuroWordNet. Next, we demonstrate how the acquired knowledge can be used to boost the performance of an open-d ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
We present an experiment for finding semantically similar words on the basis of a parsed corpus of Dutch text and show that the acquired information correlates with relations found in Dutch EuroWordNet. Next, we demonstrate how the acquired knowledge can be used to boost the performance of an open-domain question answering system for Dutch. Automatically acquired lexico-semantic information is used to improve the recall of a method for extracting function relations (such as Wim Kok is the prime minister of the Netherlands) from corpora, and to improve the precision of our QA system on general WH-questions and definition questions. 1
Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity
"... There have been many proposals to extract semantically related words using measures of distributional similarity, but these typically are not able to distinguish between synonyms and other types of semantically related words such as antonyms, (co)hyponyms and hypernyms. We present a method based on ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
There have been many proposals to extract semantically related words using measures of distributional similarity, but these typically are not able to distinguish between synonyms and other types of semantically related words such as antonyms, (co)hyponyms and hypernyms. We present a method based on automatic word alignment of parallel corpora consisting of documents translated into multiple languages and compare our method with a monolingual syntax-based method. The approach that uses aligned multilingual data to extract synonyms shows much higher precision and recall scores for the task of synonym extraction than the monolingual syntax-based approach. 1
Approximate Searching for Distributional Similarity
, 2005
"... Distributional similarity requires large volumes of data to accurately represent infrequent words. However, the nearestneighbour approach to finding synonyms suffers from poor scalability. The Spatial Approximation Sample Hierarchy (SASH), proposed by Houle (2003b), is a data structure for ap ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Distributional similarity requires large volumes of data to accurately represent infrequent words. However, the nearestneighbour approach to finding synonyms suffers from poor scalability. The Spatial Approximation Sample Hierarchy (SASH), proposed by Houle (2003b), is a data structure for approximate nearestneighbour queries that balances the efficiency /approximation trade-off. We have intergrated this into an existing distributional similarity system, tripling efficiency with a minor accuracy penalty.
Syntactic contexts for finding semantically related words
- In CLIN
"... Finding semantically related words is a first step in the direction of automatic ontology building. Guided by the view that similar words occur in similar contexts, we looked at the syntactic context of words to measure their semantic similarity. Words that occur in a direct object relation with the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Finding semantically related words is a first step in the direction of automatic ontology building. Guided by the view that similar words occur in similar contexts, we looked at the syntactic context of words to measure their semantic similarity. Words that occur in a direct object relation with the verb drink, for instance, have something in common (liquidity,...). Co-occurrence data for common nouns and proper names, for several syntactic relations, was collected from an automatically parsed corpus of 78 million words of newspaper text. We used several vector-based methods to compute the distributional similarity between words. Using Dutch EuroWordNet as evaluation standard, we investigated which vector-based method and which combination of syntactic relations is the strongest predictor of semantic similarity. 1

