Results 1 - 10
of
156
Retrieval evaluation with incomplete information
- In SIGIR
, 2004
"... This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures a ..."
Abstract
-
Cited by 249 (4 self)
- Add to MetaCart
(Show Context)
This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures are not robust to substantially incomplete relevance judgments. A new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. This finding suggests that substantially larger or dynamic test collections built using current pooling practices should be viable laboratory tools, despite the fact that the relevance information will be incomplete and imperfect.
Evaluating Evaluation Measure Stability
, 2000
"... This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while cha ..."
Abstract
-
Cited by 224 (6 self)
- Add to MetaCart
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.
Information retrieval system evaluation: Effort, sensitivity, and reliability
- In Proceedings of SIGIR
, 2005
"... The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. O ..."
Abstract
-
Cited by 115 (11 self)
- Add to MetaCart
The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.
User performance versus precision measures for simple search tasks
- Noriko Kando, Wessel Kraaij and Arjen P. de Vries (editors), Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval
, 2006
"... Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not trans-late into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so p ..."
Abstract
-
Cited by 100 (8 self)
- Add to MetaCart
(Show Context)
Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not trans-late into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evalu-ation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TRECWeb-track data: a precision-based user task, measured by the length of time that users need to find a sin-gle document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of rele-vant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55 % and 95%. Our results show that there is no significant relationship between system ef-fectiveness measured by MAP and the precision-based task. A significant, but weak relationship is present for the preci-sion at one document returned metric. A weak relationship is present between MAP and the simple recall-based task.
The Philosophy of Information Retrieval Evaluation
- In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
"... Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cran eld evaluation paradigm. In the Cran- eld paradigm, researchers perform experiments on test collections to compare the relative eectiveness of dierent retrieval approaches. The test collections allow the resear ..."
Abstract
-
Cited by 94 (2 self)
- Add to MetaCart
(Show Context)
Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cran eld evaluation paradigm. In the Cran- eld paradigm, researchers perform experiments on test collections to compare the relative eectiveness of dierent retrieval approaches. The test collections allow the researchers to control the eects of dierent system parameters, increasing the power and decreasing the cost of retrieval experiments as compared to user-based evaluations. This paper reviews the fundamental assumptions and appropriate uses of the Cran- eld paradigm, especially as they apply in the context of the evaluation conferences.
Efficient Construction of Large Test Collections
, 1998
"... Test collections with a million or more documents are needed for the evaluation of modern information retrieval systems. Yet their construction requires a great deal of effort. Judgements must be rendered as to whether or not documents are relevant to each of a set of queries. Exhaustive judging, in ..."
Abstract
-
Cited by 93 (5 self)
- Add to MetaCart
Test collections with a million or more documents are needed for the evaluation of modern information retrieval systems. Yet their construction requires a great deal of effort. Judgements must be rendered as to whether or not documents are relevant to each of a set of queries. Exhaustive judging, in which every document is examined and a judgement rendered, is infeasible for collections of this size. Current practice is represented by the "pooling method", as used in the TREC conference series, in which only the first k documents from each of a number of sources are judged. We propose two methods, Interactive Searching and Judging and Moveto -Front Pooling, that yield effective test collections while requiring many fewer judgements. Interactive Searching and Judging selects documents to be judged using an interactive search system, and may be used by a small research team to develop an effective test collection using minimal resources. Move-to-Front Pooling directly improves on the standard pooling method by using a variable number of documents from each source depending on its retrieval performance. Move-to-Front Pooling would be an appropriate replacement for the standard pooling method in future collection development efforts involving many independent groups.
Effective ranking with arbitrary passages
- Journal of the American Society for Information Science and Technology
, 2001
"... Text retrieval systems store agreat variety of documents, from abstracts, newspaper articles, and Web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of docume ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Text retrieval systems store agreat variety of documents, from abstracts, newspaper articles, and Web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the difficulties of comparing documents of different length, and enables identificationofshortblocksofrelevantmaterialamong otherwise irrelevant text. In this article, we compare severalkindsofpassageinanextensiveseriesofexperiments. We introduce anew type of passage, overlapping fragments of either fixed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval effectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents.
Liberal Relevance Criteria of TREC- Counting on Negligible Documents?
, 2002
"... Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC8 document pools on 38 topics were reassessed. The goal o ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC8 document pools on 38 topics were reassessed. The goal of the reassessment was to build a subcollection of TREC for experiments on highly relevant documents and to learn about the assessment process as well as the characteristics of a multigraded relevance corpus.
Predicting information seeker satisfaction in community question answering
- In Proceedings of SIGIR
, 2008
"... Question answering communities such as Naver and Yahoo! Answers have emerged as popular, and often effective, means of information seeking on the web. By posting questions for other participants to answer, information seekers can obtain specific answers to their questions. Users of popular portals s ..."
Abstract
-
Cited by 57 (4 self)
- Add to MetaCart
(Show Context)
Question answering communities such as Naver and Yahoo! Answers have emerged as popular, and often effective, means of information seeking on the web. By posting questions for other participants to answer, information seekers can obtain specific answers to their questions. Users of popular portals such as Yahoo! Answers already have submitted millions of questions and received hundreds of millions of answers from other participants. However, it may also take hours –and sometime days – until a satisfactory answer is posted. In this paper we introduce the problem of predicting information seeker satisfaction in collaborative question answering communities, where we attempt to predict whether a question author will be satisfied with the answers submitted by the community participants. We present a general prediction model, and develop a variety of content, structure, and community-focused features for this task. Our experimental results, obtained from a largescale evaluation over thousands of real questions and user ratings, demonstrate the feasibility of modeling and predicting asker satisfaction. We complement our results with a thorough investigation of the interactions and information seeking patterns in question answering communities that correlate with information seeker satisfaction. Our models and predictions could be useful for a variety of applications such as user intent inference, answer ranking, interface design, and query suggestion and routing.