Results 1 - 10
of
13
Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax
- In proceedings of the 35th Annual Meeting of the ACL
, 1997
"... A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using tran ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the success- ful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed.
NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax
, 1999
"... . We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature. Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%. The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation. 2 CHRISTIAN JACQUEMIN AND EVELYNE TZOUKERMANN 1.
On-Line New Event Detection, Clustering, And Tracking
, 1999
"... In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news. We present solutions to three related classification problems: new event detection, event clustering, and event tracking. The primary focus of this ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news. We present solutions to three related classification problems: new event detection, event clustering, and event tracking. The primary focus of this thesis is new event detection, where the goal is to identify news stories that have not previously been reported, in a stream of broadcast news comprising radio, television, and newswire. We present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm. We explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase
Document Classification using Multiword Features
- in: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM’98), (ACM
, 1998
"... We investigate the use of multiword query features to improve the effectiveness of text-retrieval systems that accept natural-language queries. A relevance feedback process is explained that expands an initial query with single and multiword features. The multiword features are modelled as a set of ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We investigate the use of multiword query features to improve the effectiveness of text-retrieval systems that accept natural-language queries. A relevance feedback process is explained that expands an initial query with single and multiword features. The multiword features are modelled as a set of words appearing within windows of varying sizes. Our experimental results suggest that windows of larger span yield improvements in retrieval over windows of smaller span. This result gives rise to a query contraction process that prunes 25% of the features in an expanded query with no loss in retrieval effectiveness. 1 Introduction The following work investigates the representation for queries used in text-based information retrieval systems. The query representation described has applications in document filtering, routing, and clustering in addition to website searching. Our primary focus is the use of query features that represent concepts expressible in natural language by multiple word...
Conflation-based Comparison of Stemming Algorithms
- IN PROCEEDINGS OF THE THIRD AUSTRALIAN DOCUMENT COMPUTING SYMPOSIUM
, 1998
"... In text database systems, query terms are stemmed to allow them to be conflated with variant forms of the same word. On the one hand, stemming allows the query mechanism to find documents that would otherwise not contain matches to the query terms; on the other hand, automatic stemming is prone to e ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In text database systems, query terms are stemmed to allow them to be conflated with variant forms of the same word. On the one hand, stemming allows the query mechanism to find documents that would otherwise not contain matches to the query terms; on the other hand, automatic stemming is prone to error, and can lead to retrieval of inappropriate documents. In this paper we investigate several stemming algorithms, measuring their ability to correctly conflate terms from a large text collection. We show that stemming is indeed worthwhile, but that each of the stemming algorithms we consider has distinct advantages and disadvantages; choice of stemming algorithm affects the behaviour of the retrieval mechanism.
Probabilistic Term Variant Generator for Biomedical Terms
, 2003
"... This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automatically learned from raw texts using an existing abbreviation extraction technique. Our method, therefore, requires no linguistic knowledge or labor-intensive natural language resource. We conducted an experiment using 83,142 MED-LINE abstracts for rule induction and 18,930 abstracts for testing. The results indicate that our method will significantly increase the number of retrieved documents for long biomedical terms.
Automatic Indexing: An Approach Using an Index Term Corpus and Combining Linguistic and Statistical Methods
, 2000
"... This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used as meta-information that describes documents, and that is used for seeking information. The main point of this thesis is to illustrate the process of developing an automatic indexer which analyses the content of documents by combining evidence from word frequencies and evidence from linguistic analysis provided by a syntactic parser. The indexer weights the expressions of a text according to their estimated importance for describing the content of a given document on the basis of the content analysis. The typical linguistic features of index terms were explored using a linguistically analysed text collection where the index terms are manually marked up. This text collection is referred to as an index term corpus. Specific features of the index terms provided the basis for a linguistic term-weighting scheme, which was then combined with a frequency-based term-weighting scheme. The use of an index term corpus like this as training material is a new method of developing an automatic indexer. The results of the experiments were promising.
Biomedical Text Retrieval in Languages with a Complex Morphology
- PROCEEDINGS OF THE WORKSHOP ON NATURAL LANGUAGE PROCESSING IN THE BIOMEDICAL DOMAIN
, 2002
"... Document retrieval in languages with a rich and complex morphology -- particularly in terms of derivation and (single-word) composition -- suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Document retrieval in languages with a rich and complex morphology -- particularly in terms of derivation and (single-word) composition -- suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, named entities, acronyms), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.
Automatic Multilingual Indexing and Natural Language Processing
"... The number of documents being collected by information brokers such as bibliographic database producers, libraries and publishers increases rapidly. The consequence is a huge demand for indexing and classification. So far this has had to be carried out manually. The system AUTINDEX, which is describ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The number of documents being collected by information brokers such as bibliographic database producers, libraries and publishers increases rapidly. The consequence is a huge demand for indexing and classification. So far this has had to be carried out manually. The system AUTINDEX, which is described in this paper offers tools for monolingual as well as for multilingual automatic indexing and classification by taking advantage of sophisticated language processing technologies and already existing special purpose language resources such as thesauri, classification schemes and large lexicons. It will be shown that the use of high quality NLP can achieve appropriate results.
Phonetic Models for Generating Spelling Variants
"... have several different spellings when transliterated from a non-English source language into English. Knowing the different variations can significantly improve the results of name-searches on various source texts, especially when recall is important. In this paper we propose two novel phonetic mode ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
have several different spellings when transliterated from a non-English source language into English. Knowing the different variations can significantly improve the results of name-searches on various source texts, especially when recall is important. In this paper we propose two novel phonetic models to generate numerous candidate variant spellings of a name. Our methods show threefold improvement over the baseline and generate four times as many good name variants compared to a human while

