Results 1 - 10
of
28
Corpus-Based Stemming using Co-occurrence of Word Variants
- ACM Transactions on Information Systems
, 1998
"... Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural language processing to IR, and one of the most effective in terms of user acceptance and consistent, though small, retrieval improvements. Cu ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural language processing to IR, and one of the most effective in terms of user acceptance and consistent, though small, retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant co-occurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches. Categories and Subject Descriptors: H.3.1. [Information Storage and Retrieval]: Content Analysis and Indexing -- indexing methods; linguistic processing; H.3.3. [Information Storage and Retrieval]: Information Search and Retrieval -- query f...
Viewing Stemming as Recall Enhancement
- In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1996
"... Previous research on stemming has shown both positive and negative effects on retrieval performance. This paper describes an experiment in which several linguistic and non-linguistic stemmers are evaluated on a Dutch test collection. Experiments especially focus on the measurement of Recall. Results ..."
Abstract
-
Cited by 71 (7 self)
- Add to MetaCart
Previous research on stemming has shown both positive and negative effects on retrieval performance. This paper describes an experiment in which several linguistic and non-linguistic stemmers are evaluated on a Dutch test collection. Experiments especially focus on the measurement of Recall. Results show that linguistic stemming restricted to inflection yields a significant improvement over full linguistic and non-linguistic stemming, both in average Precision and R-Recall. Best results are obtained with a linguistic stemmer which is enhanced with compound analysis. This version has a significantly better Recall than a system without stemming, without a significant deterioration of Precision. 1 Introduction One of the techniques employed in Information Retrieval (IR) to improve performance is stemming of document and query terms. By reducing morphological variance of terms (e.g. mapping singular and plural forms of the same word on a single stem) researchers hope to improve the query-...
Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis
- In SIGIR 2002
, 2002
"... Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval ..."
Abstract
-
Cited by 48 (5 self)
- Add to MetaCart
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.
Stylistic Experiments For Information Retrieval
, 2000
"... Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is the difference between two ways of saying the same thing -- and systematic stylistic variation can be used to characterize the genre of documents. These experiments investigate if stylistic information is distinguishable using simple language engineering methods, and if in that case this type of information can be used to improve information retrieval systems.
Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax
- In proceedings of the 35th Annual Meeting of the ACL
, 1997
"... A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using tran ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the success- ful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed.
Highlights: Language- and domain-independent automatic indexing terms for abstracting
- Journal of the American Society for Information Science
, 1995
"... A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language-and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more gene ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language-and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, which the author calls “highlights, ” are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Some experimental results are presented, showing operation in English, Spanish, German, Georgian, Russian, and Japanese.
Improving Precision in Information Retrieval for Swedish using Stemming
, 2001
"... We will in this paper present an evaluation of how much stemming improves precision in information retrieval for Swedish texts. To perform this, we built an information retrieval tool with optional stemming and created a tagged corpus in Swedish. ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
We will in this paper present an evaluation of how much stemming improves precision in information retrieval for Swedish texts. To perform this, we built an information retrieval tool with optional stemming and created a tagged corpus in Swedish.
NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax
, 1999
"... . We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature. Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%. The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation. 2 CHRISTIAN JACQUEMIN AND EVELYNE TZOUKERMANN 1.
Effective Use of Natural Language Processing Techniques for Automatic Conflation of Multi-Word Terms: The Role of Derivational Morphology, Part of Speech Tagging, and Shallow Parsing
- In Research and Development in Information Retrieval
"... We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the problems of degradation typically associated with derivational analysis and generation. The successful expansion and thus conflation of terms, increases indexing coverage up to 30% with precision of nearly 90% for correct identification of related terms. The fully implemented system is described with particular attention on the role of derivational morphology and phrasal relations. Results and evaluation are presented in terms of precision and recall, with an analysis and discussion of errors. This paper illustrates how natural language processing tools, when combined effectively for tasks to which they are especially suited, indicates the pote...
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings
- Information Retrieval
, 2001
"... This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate methods to deal with the problems are discussed. We will p ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate methods to deal with the problems are discussed. We will present the structured query model by Pirkola and report findings for four different language pairs concerning the effectiveness of query structuring. The architecture of our automatic query translation and construction system is presented.

