Results 1  10
of
42
Machine Learning in Automated Text Categorization
 ACM COMPUTING SURVEYS
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract

Cited by 1636 (22 self)
 Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Some simple effective approximations to the 2Poisson model for probabilistic weighted retrieval
 In Proceedings of SIGIR’94
, 1994
"... The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are withindocument term frequency, document length, and withinquery term frequency. Simple weighting functions are develope ..."
Abstract

Cited by 453 (14 self)
 Add to MetaCart
(Show Context)
The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are withindocument term frequency, document length, and withinquery term frequency. Simple weighting functions are developed, and tested on the TREC test collection. Considerable performance improvements (over simple inverse collection frequency weighting) are demonstrated. 1
2001. Models for metasearch
 In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01
"... Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This paper makes three contributions to the problem of metasearch: (1) We describe an ..."
Abstract

Cited by 238 (7 self)
 Add to MetaCart
(Show Context)
Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This paper makes three contributions to the problem of metasearch: (1) We describe and investigate a metasearch model based on an optimal democratic voting procedure, the Borda Count; (2) we describe and investigate a metasearch model based on Bayesian inference; and (3) we describe and investigate a model for obtaining upper bounds on the performance of metasearch algorithms. Our experimental results show that metasearch algorithms based on the Borda and Bayesian models usually outperform the best input system and are competitive with, and often outperform, existing metasearch strategies. Finally, our initial upper bounds demonstrate that there is much to learn about the limits of the performance of metasearch. 1.
Probabilistic Models in Information Retrieval
 The Computer Journal
, 1992
"... In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR alon ..."
Abstract

Cited by 119 (4 self)
 Add to MetaCart
In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the probabilistic parameters involved. For the estimation of these parameters, three different learning strategies are distinguished, namely queryrelated, documentrelated and descriptionrelated learning. As a representative for each of these strategies, a specific model is described. A new approach regards IR as uncertain inference; here, imaging is used as a new technique for estimating the probabilistic parameters, and probabilistic inference networks support more complex forms of inference. Finally, the more general problems of parameter estimation, query expansion and the development of models for advanced document representations are discussed.
Automatic Essay Grading Using Text Categorization Techniques
 In Proceedings of SIGIR98, 21st ACM International Conference on Research and Development in Information Retrieval
, 1998
"... The commas are the most useful and usable of all the stops. It is highly important to put them in place as you go along. If you try to come back after doing a paragraph and stick them in the various spots that tempt you you will discover that they tend to swarm like minnows into all sorts of crevice ..."
Abstract

Cited by 93 (5 self)
 Add to MetaCart
(Show Context)
The commas are the most useful and usable of all the stops. It is highly important to put them in place as you go along. If you try to come back after doing a paragraph and stick them in the various spots that tempt you you will discover that they tend to swarm like minnows into all sorts of crevices whose existence you hadnt realized and before you know it the whole long sentence becomes immobilized and lashed up squirming in commas. Better to use them sparingly, and with affection precisely when the need for one arises, nicely, by itself.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
"... This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the developmen ..."
Abstract

Cited by 70 (15 self)
 Add to MetaCart
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
A Risk Minimization Framework for Information Retrieval
 IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
(Show Context)
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model nontraditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
A New Probabilistic Model of Text Classification and Retrieval
, 1996
"... This paper introduces the multinomial model of text classification and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accoun ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
This paper introduces the multinomial model of text classification and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models, particularly the binary independence model and the 2Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC3 routing task. Results are compared with the binary independence model and with the simulation studies.
A Statistical Learning Model of Text Classification for Support Vector Machines
 In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2001
"... This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifi ..."
Abstract

Cited by 36 (0 self)
 Add to MetaCart
(Show Context)
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of textclassification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to textclassification problems successfully?