Results 1  10
of
105
Machine Learning in Automated Text Categorization
 ACM COMPUTING SURVEYS
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract

Cited by 1659 (22 self)
 Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
"... ..."
(Show Context)
Probabilistic Models for Information Retrieval based on Divergence from Randomness
 ACM Transactions on Information Systems
, 2002
"... We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive termweighting models by measuring the divergence of the actual term distribution from that obtained under a ra ..."
Abstract

Cited by 232 (5 self)
 Add to MetaCart
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive termweighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose–Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document–query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tfidf model.
A DecisionTheoretic Approach to Database Selection in Networked IR
 ACM Transactions on Information Systems
, 1996
"... this paper, we address the resource discovery issue, which consists of two subtasks, namely database detection and database selection. Database detection can be performed relatively easily, either by exploiting the name conventions used in the domain name service of the internet (e.g. names of ftp s ..."
Abstract

Cited by 146 (16 self)
 Add to MetaCart
this paper, we address the resource discovery issue, which consists of two subtasks, namely database detection and database selection. Database detection can be performed relatively easily, either by exploiting the name conventions used in the domain name service of the internet (e.g. names of ftp servers should start with `ftp.', names of Web servers with `www.') or by establishing central registries (e.g. the directoryofservers for WAIS systems)
Statistical Language Models for Information Retrieval. Tutorial Presentation at the
 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR
, 2006
"... Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for model ..."
Abstract

Cited by 118 (8 self)
 Add to MetaCart
(Show Context)
Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges. 1
A Formal Study of Information Retrieval Heuristics
 SIGIR '04
, 2004
"... Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TFIDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retriev ..."
Abstract

Cited by 102 (18 self)
 Add to MetaCart
Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TFIDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates nonoptimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.
Linguistically motivated probabilistic model of information retrieval
 In Proceedings of European Conference on Digital Libraries
, 1998
"... Abstract. This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is e ..."
Abstract

Cited by 79 (16 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the Cranfield test collection indicates that the presented model outperforms the vector space model with classical tf×idf and cosine length normalisation.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
"... This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the developmen ..."
Abstract

Cited by 71 (15 self)
 Add to MetaCart
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
A Risk Minimization Framework for Information Retrieval
 IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
(Show Context)
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model nontraditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications
 Journal of the American Society for Information Science
, 1999
"... In the logical approach to information retrieval (IR), retrieval is considered as uncertain inference. ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
In the logical approach to information retrieval (IR), retrieval is considered as uncertain inference.