Results 1 - 10
of
42
Optimizing Search Engines using Clickthrough Data
, 2002
"... This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches ..."
Abstract
-
Cited by 568 (20 self)
- Add to MetaCart
This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.
Retrieval Evaluation with Incomplete Information
, 2004
"... This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures a ..."
Abstract
-
Cited by 121 (3 self)
- Add to MetaCart
This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures are not robust to substantially incomplete relevance judgments. A new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. This finding suggests that substantially larger or dynamic test collections built using current pooling practices should be viable laboratory tools, despite the fact that the relevance information will be incomplete and imperfect.
Minimal test collections for retrieval evaluation
- In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2006
"... Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of realworld implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collectio ..."
Abstract
-
Cited by 56 (8 self)
- Add to MetaCart
Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of realworld implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95 % confidence.
Evaluating retrieval performance using clickthrough data
, 2003
"... This paper proposes a new method for evaluating the quality of retrieval functions. Unlike traditional methods that require relevance judgments by experts or explicit user feedback, it is based entirely on clickthrough data. This is a key advantage, since clickthrough data can be collected at very l ..."
Abstract
-
Cited by 44 (6 self)
- Add to MetaCart
This paper proposes a new method for evaluating the quality of retrieval functions. Unlike traditional methods that require relevance judgments by experts or explicit user feedback, it is based entirely on clickthrough data. This is a key advantage, since clickthrough data can be collected at very low cost and without overhead for the user. Taking an approach from experiment design, the paper proposes an experiment setup that generates unbiased feedback about the relative quality of two search results without explicit user feedback. A theoretical analysis shows that the method gives the same results as evaluation with traditional relevance judgments under mild assumptions. An empirical analysis verifies that the assumptions are indeed justified and that the new method leads to conclusive results in a WWW retrieval study. 1
How Does Clickthrough Data Reflect Retrieval Quality?
"... Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number of clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.
Evaluating search engines by modeling the relationship between relevance and clicks
- In Proceedings of the Advances in Neural Information Processing Systems (NIPS
, 2007
"... We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgment ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82 % of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94 % of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1
Forming Test Collections with No System Pooling
- In Proceedings of SIGIR
, 2004
"... Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test coll ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test collections where no system pooling is used. First, a collection formation technique combining manual feedback and multiple systems is adapted to work with a single retrieval system. Second, an existing method based on pooling the output of multiple manual searches is re-examined: testing a wider range of searchers and retrieval systems than has been examined before. Third, a new approach is explored where the ranked output of a single automatic search on a single retrieval system is assessed for relevance: no pooling whatsoever. Using established techniques for evaluating the quality of relevance judgments, in all three cases, test collections are formed that are as good as TREC.
Methods for ranking information retrieval systems without relevance judgments
- In Proceedings of the 2003 ACM Symposium on applied computing (SAC
, 2003
"... In this paper we present some new methods of ranking information retrieval systems without relevance judgement. The common ground of these methods is using a measure we called reference count. An extensive experimentation was conducted to evaluate the effectiveness of the proposed methods using vari ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
In this paper we present some new methods of ranking information retrieval systems without relevance judgement. The common ground of these methods is using a measure we called reference count. An extensive experimentation was conducted to evaluate the effectiveness of the proposed methods using various different standards Information Retrieval evaluation measures for the ranking, like average precision, R-precision, and precision and different document levels. We also compared the effectiveness of the proposed methods with the method proposed by Soboroff et al. The experimental results showed that the proposed methods are effective, and in many cases are more effective than Soboroff at al.’s method.
Hits hits TREC: exploring IR evaluation results with network analysis
- In Proc. 30th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval
, 2007
"... We propose a novel method of analysing data gathered from TREC or similar information retrieval evaluation experiments. We define two normalized versions of average precision, that we use to construct a weighted bipartite graph of TREC systems and topics. We analyze the meaning of well known — and s ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
We propose a novel method of analysing data gathered from TREC or similar information retrieval evaluation experiments. We define two normalized versions of average precision, that we use to construct a weighted bipartite graph of TREC systems and topics. We analyze the meaning of well known — and somewhat generalized — indicators from social network analysis on the Systems-Topics graph. We apply this method to an analysis of TREC 8 data; among the results, we find that authority measures systems performance, that hubness of topics reveals that some topics are better than others at distinguishing more or less effective systems, that with current measures a system that wants to be effective in TREC needs to be effective on easy topics, and that by using different effectiveness measures this is no longer the case.

