Results 1  10
of
15
Probabilistic Models in Information Retrieval
 The Computer Journal
, 1992
"... In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR alon ..."
Abstract

Cited by 104 (4 self)
 Add to MetaCart
In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the probabilistic parameters involved. For the estimation of these parameters, three different learning strategies are distinguished, namely queryrelated, documentrelated and descriptionrelated learning. As a representative for each of these strategies, a specific model is described. A new approach regards IR as uncertain inference; here, imaging is used as a new technique for estimating the probabilistic parameters, and probabilistic inference networks support more complex forms of inference. Finally, the more general problems of parameter estimation, query expansion and the development of models for advanced document representations are discussed.
Evaluating Text Categorization
 In Proceedings of Speech and Natural Language Workshop
, 1991
"... While certain standard procedures are widely used for evaluating text retrieval systems and algorithms, the same is not true for text categorization. Omission of important data from reports is common and methods of measuring effectiveness vary widely. This has made judging the relative merits of tec ..."
Abstract

Cited by 94 (7 self)
 Add to MetaCart
While certain standard procedures are widely used for evaluating text retrieval systems and algorithms, the same is not true for text categorization. Omission of important data from reports is common and methods of measuring effectiveness vary widely. This has made judging the relative merits of techniques for text categorization difficult and has disguised important research issues. In this paper I discuss a variety of ways of evaluating the effectiveness of text categorization systems, drawing both on reported categorization experiments and on methods used in evaluating querydriven retrieval. I also consider the extent to which the same evaluation methods may be used with systems for text extraction, a more complex task. In evaluating either kind of system, the purpose for which the output is to be used is crucial in choosing appropriate evaluation methods. INTRODUCTION Text classification systems, i.e. systems which can make distinctions between meaningful classes of texts, have ...
Models for retrieval with probabilistic indexing
 Information Processing and Management
, 1989
"... Abstract in this article three retrieval models for probabilistic indexing are described along with evaluation results for each. First is the binary independence indexing @II) model, which is a generalized version of the Maron and Kuhns indexing model. In this model, the indexing weight of a descri ..."
Abstract

Cited by 86 (14 self)
 Add to MetaCart
Abstract in this article three retrieval models for probabilistic indexing are described along with evaluation results for each. First is the binary independence indexing @II) model, which is a generalized version of the Maron and Kuhns indexing model. In this model, the indexing weight of a descriptor in a document is an estimate of the probability of relevance of this document with respect to queries using this descriptor. Second is the retrievalwithprobabilisticindexing (RPI) model, which is suited to different kinds of probabilistic indexing. For that we assume that each indexing scheme has its own concept of “correctness ” to which the probabilities relate. In addition to the probabilistic indexing weights, the RPI model provides the possibility of reIevance weighting of search terms. A third mode1 that is similar was proposed by Croft some years ago as an extension of the binary independence retrieval model but it can be shown that this model is not based on the probabilistic ranking principle. The probabilistic indexing weights required for any of these models can be provided by an application of the Darmstadt indexing approach (DIA) for indexing with descriptors from a controlled vocabuIary. The experimental results show signi~cant improvements over retrieval with binary indexing. Finally, suggestions are made regarding how the DIA can be applied to probabilistic indexing with free text terms. 1.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
"... This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the developmen ..."
Abstract

Cited by 63 (14 self)
 Add to MetaCart
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
Automatic Indexing Based on Bayesian Inference Networks
 IN PROC 16TH ANN INT ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'93
, 1993
"... In this paper, a Bayesian inference network model for automatic indexing with index terms (descriptors) from a prescribed vocabulary is presented. It requires an indexing dictionary with rules mapping terms of the respective subject field onto descriptors and inverted lists for terms occuring in a s ..."
Abstract

Cited by 54 (0 self)
 Add to MetaCart
In this paper, a Bayesian inference network model for automatic indexing with index terms (descriptors) from a prescribed vocabulary is presented. It requires an indexing dictionary with rules mapping terms of the respective subject field onto descriptors and inverted lists for terms occuring in a set of documents of the subject field and descriptors manually assigned to these documents. The indexing dictionary can be derived automatically from a set of manually indexed documents. An application of the network model is described, followed by an indexing example and some experimental results about the indexing performance of the network model.
AIR/X  a RuleBased Multistage Indexing System for Large Subject Fields
 PROCEEDINGS OF RIAO'91
, 1991
"... AIR/X is a rulebased system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
AIR/X is a rulebased system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the Darmstadt Indexing Approach, the indexing task is divided into a description step and a decision step. First, terms (single words or phrases) are identified in the document text. With termdescriptor rules from the dictionary, descriptor indications are formed. The set of all indications from a document leading to the same descriptor is called a relevance description. A probabilistic classification procedure computes indexing weights for each relevance description. Since the whole system is rulebased, it can be adapted to different subject fields by appropriate modifications of the rule bases. A major application of AIR/X is the AIR/PHYS system developed for a large physics database. This application is described in more detail along with experimental results.
Probabilistic Information Retrieval as Combination of Abstraction, Inductive Learning and Probabilistic Assumptions
, 1994
"... We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
A Statistical Learning Model of Text Classification for Support Vector Machines
 In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2001
"... This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifi ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to textclassification problems successfully?
Combining ModelOriented and DescriptionOriented Approaches for Probabilistic Indexing
"... We distinguish modeloriented and descriptionoriented approaches in probabilistic information retrieval. The former refer to certain representations of documents and queries and use additional independence assumptions, whereas the latter map documents and queries onto feature vectors which form the ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
We distinguish modeloriented and descriptionoriented approaches in probabilistic information retrieval. The former refer to certain representations of documents and queries and use additional independence assumptions, whereas the latter map documents and queries onto feature vectors which form the input to certain classification procedures or regression methods. Descriptionoriented approaches are more flexible with respect to the underlying representations, but the definition of the feature vector is a heuristic step. In this paper, we combine a probabilistic model for the Darmstadt Indexing Approach with logistic regression. Here the probabilistic model forms a guideline for the definition of the feature vector. Experiments with the purely theoretical approach and with several heuristic variations show that heuristic assumptions may yield significant improvements.
Domain independent automatic keyphrase indexing with small training sets
 J. Am. Soc. Information Science and Technology
, 2008
"... Keyphrases are widely used in both physical and digital libraries as a brief but precise summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive, because trained human indexers m ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Keyphrases are widely used in both physical and digital libraries as a brief but precise summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive, because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloguing rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domainspecific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents. 1.