Results 1 - 10
of
48
A comparison of classifiers and document representations for the routing problem
- ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR
, 1995
"... In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant ..."
Abstract
-
Cited by 147 (2 self)
- Add to MetaCart
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
A survey of information retrieval and filtering methods
, 1995
"... We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks).
Regularizing ad hoc retrieval scores
, 2005
"... The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regulariz ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regularization. Score regularization can be presented as an optimization problem, allowing the use of results from semisupervised learning. We demonstrate that regularized scores consistently and significantly rank documents better than un-regularized scores, given a variety of initial retrieval algorithms. We evaluate our method on two large corpora across a substantial number of topics.
Optimizing Ranking Functions: A Connectionist Approach to Adaptive Information Retrieval
- DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, THE UNIVERSITY OF CALIFORNIA, SAN DIEGO
, 1994
"... This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the document ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the documents to the user's information need (or query). The ordering enables the user to quickly find documents of interest. Ranked retrieval is a difficult problem because of the ambiguity of natural language, the large size of the collections, and because of the varying needs of users and varying collection characteristics. We propose and empirically validate general adaptive methods which improve the ability of a large class of retrieval systems to rank documents effectively. Our main adaptive method is to numerically optimize free parameters in a retrieval system by minimizing a non-metric criterion function. The criterion measures how well the system is ranking documents relative to a target ordering, defined by a set of training queries which include the users' desired document orderings. Thus, the system learns parameter settings which better enable it to rank relevant documents before irrelevant. The non-metric approach is interesting because it is a general adaptive method, an alternative to supervised methods for training neural networks in domains in which rank order or prioritization is important. A second adaptive method is also examined, which is applicable to a restricted class of retrieval systems but which permits an analytic solution. The adaptive methods are applied to a number of problems in text retrieval to validate their utility and practical efficiency. The applications include: A dimensionality reduction of vector-based document representations to a vector spa...
Learning in Intelligent Information Retrieval
- In Proceedings of the Eighth International Workshop on Machine Learning
, 1991
"... Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of documents and user queries has made feedback and adaptation important characteristics of IR systems. In this ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of documents and user queries has made feedback and adaptation important characteristics of IR systems. In this paper we briefly survey previous research on machine learning in IR systems and discuss promising areas for future research at the intersection of these two fields. 1 Introduction The goal of information retrieval (IR) techniques is to find, within a large database of documents, those documents which satisfy a user information need. Typically the stored documents are composed of natural language text, though IR techniques have also been applied to databases of stored speech, images, computer source code, and other forms of information. In contrast to conventional database techniques, IR techniques are most useful when the semantics of the objects to be retrieved is unclear, and the relation...
Adaptive Information Agents in Distributed Textual Environments
, 1998
"... Hypertext environments such as the Web are rich with both word and link cues that can be exploited by autonomous agents performing distributed tasks on behalf of the user. This paper characterizes such environments and identifies the features that are most useful and readily available. We describe t ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
Hypertext environments such as the Web are rich with both word and link cues that can be exploited by autonomous agents performing distributed tasks on behalf of the user. This paper characterizes such environments and identifies the features that are most useful and readily available. We describe the adaptive representation of an ecology of retrieval agents who attempt to capture important features of their surroundings, and base their behaviors upon them. We discuss how such a representation allows the agents to interact with the environments where they are situated. Agents can internalize words that are locally correlated with fitness, based on user feedback. They are shown to outperform nonadaptive search by an order of magnitude. Furthermore, each agent learns new strategies at local time and space scales, while the population evolves at a global scale. 1 Introduction Imagine that you just submitted a query to your favorite digital library or search engine on the Web, and receiv...
Long-Term Learning for Web Search Engines
- In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002
, 2002
"... This paper considers hoxv web search engines can learn front the successful searches recorded in their user logs. Document Transfor marion is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This paper considers hoxv web search engines can learn front the successful searches recorded in their user logs. Document Transfor marion is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document TYansformatiom but *ve show how a rigorous evahmtion of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy tbr Document Transformation that is suitable for long-term incremental learning. Our experiments show that Document Transformation inrprovcs retrieval performance over a medium sized collection of webpages. Commercial search engines nmy be able to achieve sinfilar improvements by incorporating this approach.
Optimizing Parameters in a Ranked Retrieval System Using Multi-Query Relevance Feedback
- In Proceedings of the Symposium on Document Analysis and Information Retrieval, Las Vegas
, 1994
"... A method is proposed by which parameters in ranked-output text retrieval systems can be automatically optimized to improve retrieval performance. A ranked-output text retrieval system implements a ranking function which orders documents, placing documents estimated to be more relevant to the user's ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
A method is proposed by which parameters in ranked-output text retrieval systems can be automatically optimized to improve retrieval performance. A ranked-output text retrieval system implements a ranking function which orders documents, placing documents estimated to be more relevant to the user's query before less relevant ones. The proposed method is to adjust system parameters to maximize the match between the system's document ordering and the user's desired ordering, given by relevance feedback. The utility of the approach is demonstrated by estimating the similarity measure in a vector space model of information retrieval. The approach automatically finds a similarity measure which performs equivalent to or better than all "classic" similarity measures studied. It also performs within 1% of an estimated theoretically optimal measure. 1 Introduction State of the art document retrieval systems have a large number of free parameters, such as the weights on terms in documents, para...
Machine Learning of User Profiles: Representational Issues
- In Proceedings of the Thirteenth National Conference on Artificial Intelligence
, 1996
"... As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user in ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction. The research described here focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web (the IDD News Browser), demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning (ML) to address an inform...

