Results 1 - 10
of
101
Context-Sensitive Learning Methods for Text Categorization
- ACM Transactions on Information Systems
, 1996
"... this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proc ..."
Abstract
-
Cited by 213 (12 self)
- Add to MetaCart
this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) pp. 307--315
Modern information retrieval: a brief overview
- BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING
, 2001
"... For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) wa ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.
Corpus-Based Stemming using Co-occurrence of Word Variants
- ACM Transactions on Information Systems
, 1998
"... Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural language processing to IR, and one of the most effective in terms of user acceptance and consistent, though small, retrieval improvements. Cu ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural language processing to IR, and one of the most effective in terms of user acceptance and consistent, though small, retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant co-occurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches. Categories and Subject Descriptors: H.3.1. [Information Storage and Retrieval]: Content Analysis and Indexing -- indexing methods; linguistic processing; H.3.3. [Information Storage and Retrieval]: Information Search and Retrieval -- query f...
Viewing Stemming as Recall Enhancement
- In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1996
"... Previous research on stemming has shown both positive and negative effects on retrieval performance. This paper describes an experiment in which several linguistic and non-linguistic stemmers are evaluated on a Dutch test collection. Experiments especially focus on the measurement of Recall. Results ..."
Abstract
-
Cited by 71 (7 self)
- Add to MetaCart
Previous research on stemming has shown both positive and negative effects on retrieval performance. This paper describes an experiment in which several linguistic and non-linguistic stemmers are evaluated on a Dutch test collection. Experiments especially focus on the measurement of Recall. Results show that linguistic stemming restricted to inflection yields a significant improvement over full linguistic and non-linguistic stemming, both in average Precision and R-Recall. Best results are obtained with a linguistic stemmer which is enhanced with compound analysis. This version has a significantly better Recall than a system without stemming, without a significant deterioration of Precision. 1 Introduction One of the techniques employed in Information Retrieval (IR) to improve performance is stemming of document and query terms. By reducing morphological variance of terms (e.g. mapping singular and plural forms of the same word on a single stem) researchers hope to improve the query-...
Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis
- In SIGIR 2002
, 2002
"... Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval ..."
Abstract
-
Cited by 48 (5 self)
- Add to MetaCart
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.
Text mining at the term level
- Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'98
, 1998
"... Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of i ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995-1996. 1
Learning Random Walk Models for Inducing Word Dependency Distributions
- IN ICML
, 2004
"... Many NLP tasks rely on accurately estimating word dependency probabilities P(w 1 |w 2 ), where the words w 1 and w 2 have a particular relationship (such as verb-object). Because of the sparseness of counts of such dependencies, smoothing and the ability to use multiple sources of knowledge ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
Many NLP tasks rely on accurately estimating word dependency probabilities P(w 1 |w 2 ), where the words w 1 and w 2 have a particular relationship (such as verb-object). Because of the sparseness of counts of such dependencies, smoothing and the ability to use multiple sources of knowledge are important challenges. For example, if the probability P(N ) of noun N being the subject of verb V is high, and V takes similar objects to V # , and V # is synonymous to V ## , then we want to conclude that P(N ## ) should also be reasonably high---even when those words did not cooccur in the training data. To capture
Guessing Morphology from Terms and Corpora
- Proceedings of SIGIR 97
"... This study proposes an algorithm for automatically acquiring morphological links between words. This algorithm relies on the concurrent use of a corpus and a list of multi-word terms, and does not require any prior linguistic knowledge. The four steps of the algorithm are (1) single-word truncation, ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
This study proposes an algorithm for automatically acquiring morphological links between words. This algorithm relies on the concurrent use of a corpus and a list of multi-word terms, and does not require any prior linguistic knowledge. The four steps of the algorithm are (1) single-word truncation, (2) conflation of multi-word terms, (3) classification and filtering, and (4) clustering of conflation classes. At each step a precise evaluation is performed in order to chose the optimal parameters. The final results indicate a clustering of 45% of the classes with a precision of 87%. The derivational knowledge acquired through this method can be used for conceiving a domain-oriented stemmer for scientific and technical corpora. In Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), Philadelphia, PA. 27-31 July 1997. 1
From frequency to meaning : Vector space models of semantics
- Journal of Artificial Intelligence Research
, 2010
"... Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are begi ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field. 1.
The Impact of Query Structure and Query Expansion on Retrieval
- Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1998
"... The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY ). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structure ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY ). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structures (averages and weighted averages of the weights of the keys) and strong structures (e.g., queries with more elaborated search key relations). QE was based on concepts, which were first selected from a conceptual model, and then expanded by semantic relationships given in the model. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with one of the strong structures at the largest expansion level.

