Results 1 - 10
of
25
Text Categorization Based on Regularized Linear Classification Methods
- Information Retrieval
, 2000
"... A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document ..."
Abstract
-
Cited by 67 (2 self)
- Add to MetaCart
A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is therefore worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
"... This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the developmen ..."
Abstract
-
Cited by 55 (12 self)
- Add to MetaCart
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
A Theory of Term Weighting Based on Exploratory Data Analysis
- Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1998
"... Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is investigated. A correlation between document frequency normalized by collection size and the mutual information between relevance and term occurrence is uncovered. This correlation is found to be robust across a variety of query sets and document collections. Based on this relationship, a theoretical explanation of the efficacy of inverse document frequency for term weighting is developed which differs in both style and content from theories previously put forth. The theory predicts that a "flattening" of idf at both low and high frequency should result in improved retrieval performance. This altered idf formulation is tested on all TREC query sets. Retrieval results corroborate the predicti...
Probabilistic Information Retrieval as Combination of Abstraction, Inductive Learning and Probabilistic Assumptions
, 1994
"... We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
Information Retrieval: A Survey
, 2000
"... Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. T ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This report is a tutorial and survey of the state of the art, both research and commercial, in this dynamic field. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or nee...
Design and development of a networkbased electronic library
- Proceedings of the ASIS Midyear Meeting
, 1994
"... Among the proposed innovations in the Clinton Administration's plans to develop a National Information Infrastructure is the creation of, and support for, digital or electronic libraries to store and provide access to the vast amounts of information expected to made available over the \information s ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Among the proposed innovations in the Clinton Administration's plans to develop a National Information Infrastructure is the creation of, and support for, digital or electronic libraries to store and provide access to the vast amounts of information expected to made available over the \information superhighway". Although the exact nature and future architecture of such libraries is still a matter for experimentation (and debate), there are several pioneering e orts underway to establish electronic libraries and to provide access to them. This paper describes one such e ort underway at the University of California at Berkeley. In collaboration with four other universities we are developing interoperable electronic library servers containing the Computer Science technical reports for each participant and making them available over the internet using standard protocols.
Economics and Search
- SIGIR Forum
, 1999
"... , Fall, 1999. 1 1 ECONOMIC VALUE OF INFORMATION 2 1 Economic value of information Economists define the (economic) value of information in the context of an optimal choice problem. A consumer is making a choice to maximize expected utility or minimize expected cost. The value of information is th ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
, Fall, 1999. 1 1 ECONOMIC VALUE OF INFORMATION 2 1 Economic value of information Economists define the (economic) value of information in the context of an optimal choice problem. A consumer is making a choice to maximize expected utility or minimize expected cost. The value of information is the increment in expected utility resulting from the improved choice made possible by better information. Often this can be translated into some monetary equivalent representing how much someone would pay to acquire a given piece of information. (See Laffont [1989], page 61.) To take a very simple example in an IR context, suppose that a user is given two sealed envelopes, one containing $100 the other containing $0. She is allowed to choose one, open it, and keep whatever is inside. To make things simple, suppose that she is risk-neutral, in the sense that she only cares about expected value. In the absence of any information, she would
From Retrieval Status Values to Probabilities of Relevance for Advanced IR Applications
- Information Retrieval
, 2003
"... this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distribute ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection) . These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function
Adaptive Relevance Feedback in Information Retrieval
"... Relevance Feedback has proven very effective for improving retrieval accuracy. A difficult yet important problem in all relevance feedback methods is how to optimally balance the original query and feedback information. In the current feedback methods, the balance parameter is usually set to a fixed ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Relevance Feedback has proven very effective for improving retrieval accuracy. A difficult yet important problem in all relevance feedback methods is how to optimally balance the original query and feedback information. In the current feedback methods, the balance parameter is usually set to a fixed value across all the queries and collections. However, due to the difference in queries and feedback documents, this balance parameter should be optimized for each query and each set of feedback documents. In this paper, we present a learning approach to adaptively predict the optimal balance coefficient for each query and each collection. We propose three heuristics to characterize the balance between query and feedback information. Taking these three heuristics as a road map, we explore a number of features and combine them using a regression approach to predict the balance coefficient. Our experiments show that the proposed adaptive relevance feedback is more robust and effective than the regular fixed-coefficient feedback.
Cheshire II at TREC 6: Interactive Probabilistic Retrieval
- TREC 6 Proceedings (Notebook
, 1997
"... This paper briefly describes the features of the Cheshire II system and how it was used in the TREC 6 Interactive track. The results of the interactive track are discussed and future improvements to the Cheshire II system are considered. 1 Introduction The Cheshire II system was originally designed ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper briefly describes the features of the Cheshire II system and how it was used in the TREC 6 Interactive track. The results of the interactive track are discussed and future improvements to the Cheshire II system are considered. 1 Introduction The Cheshire II system was originally designed to apply probabilistic retrieval methods to searching in online library catalogs in order to help overcome the twin problems of topical searching that are pervasive in "second generation" Boolean online catalogs: search failure and information overload. It was originally intended to be a next-generation online catalog and full-text information retrieval system that would apply probabilistic retrieval methods to simple MARC records and clustered record surrogates (Classification clusters)(Larson 1991c; Larson, et al. 1996). Over time the system has been explanded to include support for full-text SGML documents (ranging from simple document types as used in the TREC database to complex fullte...

