Results 1 - 10
of
31
COMBINING APPROACHES TO INFORMATION RETRIEVAL
"... The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the W ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.
Text Categorization Based on Regularized Linear Classification Methods
- Information Retrieval
, 2000
"... A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document ..."
Abstract
-
Cited by 67 (2 self)
- Add to MetaCart
A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is therefore worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.
Discriminative Models for Information Retrieval
- SIGIR '04
, 2004
"... Discriminative models have been preferred over generative models in many machine learning problems in the recent past owing to some of their attractive theoretical properties. In this paper, we explore the applicability of discriminative classifiers for IR. We have compared the performance of two po ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
Discriminative models have been preferred over generative models in many machine learning problems in the recent past owing to some of their attractive theoretical properties. In this paper, we explore the applicability of discriminative classifiers for IR. We have compared the performance of two popular discriminative models, namely the maximum entropy model and support vector machines with that of language modeling, the state-of-the-art generative model for IR. Our experiments on ad-hoc retrieval indicate that although maximum entropy is significantly worse than language models, support vector machines are on par with language models. We argue that the main reason to prefer SVMs over language models is their ability to learn arbitrary features automatically as demonstrated by our experiments on the home-page finding task of TREC-10.
A Theory of Term Weighting Based on Exploratory Data Analysis
- Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1998
"... Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is investigated. A correlation between document frequency normalized by collection size and the mutual information between relevance and term occurrence is uncovered. This correlation is found to be robust across a variety of query sets and document collections. Based on this relationship, a theoretical explanation of the efficacy of inverse document frequency for term weighting is developed which differs in both style and content from theories previously put forth. The theory predicts that a "flattening" of idf at both low and high frequency should result in improved retrieval performance. This altered idf formulation is tested on all TREC query sets. Retrieval results corroborate the predicti...
Report on the TREC-5 Experiment: Data Fusion and Collection Fusion
, 1988
"... This paper describes and evaluates a retrieval model that considers the problem of data fusion and collection fusion as two faces of the same coin. To establish a clear theoretical foundation for combining various sources of evidence provided either by different search schemes (data fusion) or by di ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
This paper describes and evaluates a retrieval model that considers the problem of data fusion and collection fusion as two faces of the same coin. To establish a clear theoretical foundation for combining various sources of evidence provided either by different search schemes (data fusion) or by distributed information services (collection fusion), we have implemented a retrieval model based on the logistic regression methodology. Participation: Category B, ad-hoc automatic Introduction There exist many reasons for considering multiple sources of evidence in information retrieval (Katzer et al., 1982), (Saracevic & Kantor, 1988), (Harman, 1995), and their integration is usually studied in two distinct contexts. Various retrieval strategies or query formulations may operate on the same collection (data fusion problem) (Belkin et al., 1995), (Lee, 1995), subject described in the first part. The second part deals with the collection fusion problem or how distributed information servers may collaborate to answer to a given request (collection fusion problem) (Callan et al., 1995), (Voorhees et al., 1995). - 1. Data Fusion Problem To combine different retrieval schemes (or different query formulations), a retrieval engine might first find the retrieved set associated with each search scheme, and then merge them into a single effective ranked list. To define this underlying merging function, we may consider, for each retrieved record, its rank and / or its retrieval status value. However, the retrieval status values obtained by various weighting schemes may not have a range of possible similar values, leading to a more complex combination situation. Section 1.1 outlines our test-collection and some evaluations of individual retrieval schemes based on two distinct query ...
Probabilistic Models for Combining Diverse Knowledge Sources in Multimedia Retrieval
- In Ph.D Thesis
, 2006
"... In recent years, the multimedia retrieval community is gradually shifting its emphasis from analyzing one media source at a time to exploring the opportunities of combining diverse knowledge sources from correlated media types and context. This thesis presents a conditional probabilistic retrieval m ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
In recent years, the multimedia retrieval community is gradually shifting its emphasis from analyzing one media source at a time to exploring the opportunities of combining diverse knowledge sources from correlated media types and context. This thesis presents a conditional probabilistic retrieval model as a principled framework to combine diverse knowledge sources. An efficient rank-based learning approach has been developed to explicitly model the ranking relations in the learning process. Under this retrieval framework, we overview and develop a number of state-of-the-art approaches for extracting ranking features from multimedia knowledge sources. To incorporate query information in the combination model, this thesis develops a number of query analysis models that can automatically discover mixing structure of the query space based on previous retrieval results. To adapt the combination function on a per query basis, this thesis also presents a probabilistic local context analysis(pLCA) model to automatically leverage additional retrieval sources to improve initial retrieval outputs. All the proposed approaches are evaluated on multimedia retrieval tasks with large-scale video collections as well as meta-search tasks with large-scale text collections. 1
Ranking function optimization for effective web search by genetic programming: An empirical study
- in Proceedings of 37th Hawaii International Conference on System Sciences
, 2004
"... Abstract — Web search engines have become indispensable in our daily life to help us find the information we need. Although search engines are very fast in search response time, their effectiveness in finding useful and relevant documents at the top of the search hit list needs to be improved. In th ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
Abstract — Web search engines have become indispensable in our daily life to help us find the information we need. Although search engines are very fast in search response time, their effectiveness in finding useful and relevant documents at the top of the search hit list needs to be improved. In this paper, we report our experience applying Genetic Programming (GP) to the ranking function discovery problem leveraging the structural information of HTML documents. Our empirical experiments using the web track data from recent TREC conferences show that we can discover better ranking functions than existing well-known ranking strategies from IR, such as Okapi, Ptfidf. The performance is even comparable to those obtained by Support Vector Machine. I.
From uncertain inference to probability of relevance for advanced IR applications
- 25th European Conference on Information Retrieval Research (ECIR 2003
, 2003
"... Abstract. Uncertain inference is a probabilistic generalisation of the logical view on databases, ranking documents according to their probabilities that they logically imply the query. For tasks other than ad-hoc retrieval, estimates of the actual probability of relevance are required. In this pape ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Abstract. Uncertain inference is a probabilistic generalisation of the logical view on databases, ranking documents according to their probabilities that they logically imply the query. For tasks other than ad-hoc retrieval, estimates of the actual probability of relevance are required. In this paper, we investigate mapping functions between these two types of probability. For this purpose, we consider linear and logistic functions. The former have been proposed before, whereas we give a new theoretic justification for the latter. In a series of upper-bound experiments, we compare the goodness of fit of the two models. A second series of experiments investigates the effect on the resulting retrieval quality in the fusion step of distributed retrieval. These experiments show that good estimates of the actual probability of relevance can be achieved, and the logistic model outperforms the linear one. However, retrieval quality for distributed retrieval (only merging, without resource selection) is only slightly improved by using the logistic function. 1
Evaluation of Learning Schemes Used in Information Retrieval
, 1996
"... Searching within the context of information retrieval may be viewed as a communication process between the users and the indexers (or the authors). It is known that in expressing the same concept or idea, different people tend to use different words or phrases, and also that the meaning of words att ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Searching within the context of information retrieval may be viewed as a communication process between the users and the indexers (or the authors). It is known that in expressing the same concept or idea, different people tend to use different words or phrases, and also that the meaning of words attached to document surrogates tends to change over time. To overcome these phenomena, various learning schemes have been designed so as to automatically infer knowledge about document content from the relevance assessments of past queries. Thus, in contrast to most retrieval models that represent the semantic content of documents as static entities, these adaptive search models might change the descriptions of documents through an inductive learning scheme. The evaluation of such dynamic document space strategies may be based on retrospective tests within which the same set of queries is applied to train and test the system. Based on cross-validation principles, this paper suggests a more "ho...

