Results 1 - 10
of
54
Using Statistical Testing in the Evaluation of Retrieval Experiments
, 1993
"... The standard strategies for evaluation based on precision and recall are examined and their relative advantages and disadvantages are discussed. In particular, it is suggested that relevance feedback be evaluated from the perspective of the user. A number of different statistical tests are described ..."
Abstract
-
Cited by 149 (0 self)
- Add to MetaCart
The standard strategies for evaluation based on precision and recall are examined and their relative advantages and disadvantages are discussed. In particular, it is suggested that relevance feedback be evaluated from the perspective of the user. A number of different statistical tests are described for determining if differences in performance between retrieval methods are significant. These tests have often been ignored in the past because most are based on an assumption of normality which is not strictly valid for the standard performance measures. However, one can test this assumption using simple diagnostic plots, and if it is a poor approximation, there are a number of non-parametric alternatives.
Evaluating Evaluation Measure Stability
, 2000
"... This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while cha ..."
Abstract
-
Cited by 131 (5 self)
- Add to MetaCart
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.
How Reliable are the Results of Large-Scale Information Retrieval Experiments?
- Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1998
"... Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate e#ectiveness. We consider both of these stages in the context of the TREC experiments, to determine whether they lead to measuremen ..."
Abstract
-
Cited by 100 (3 self)
- Add to MetaCart
Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate e#ectiveness. We consider both of these stages in the context of the TREC experiments, to determine whether they lead to measurements that are trustworthy and fair. Our detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found. We propose a new pooling strategy that can significantly increase the number of relevant documents found for given e#ort, without compromising fairness.
Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback
, 1998
"... ..."
The challenge problem for automated detection of 101 semantic concepts in multimedia
- In Proceedings of the ACM International Conference on Multimedia
, 2006
"... We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic ..."
Abstract
-
Cited by 89 (18 self)
- Add to MetaCart
We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic examination of automated concept detection methods, by decomposing the generic video indexing problem into 2 unimodal analysis experiments, 2 multimodal analysis experiments, and 1 combined analysis experiment. For each experiment, we evaluate generic video indexing performance on 85 hours of international broadcast news data, from the TRECVID 2005/2006 benchmark, using a lexicon of 101 semantic concepts. By establishing a minimum performance on each experiment, the challenge problem allows for component-based optimization of the generic indexing issue, while simultaneously offering other researchers a reference for comparison during indexing methodology development. To stimulate further investigations in intermediate analysis steps that influence video indexing performance, the challenge offers to the research community a manually annotated concept lexicon, pre-computed low-level multimedia features, trained classifier models, and five experiments together with baseline performance, which are all available at
A lattice conceptual clustering system and its application to browsing retrieval
- Machine Learning
, 1996
"... Abstract. The theory of concept (or Galois) lattices provides a simple and formal approach to conceptual clustering. In this paper we present GALOIS, a system that automates and applies this theory. The algorithm utilized by GALOIS to build a concept lattice is incremental and efficient, each update ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
Abstract. The theory of concept (or Galois) lattices provides a simple and formal approach to conceptual clustering. In this paper we present GALOIS, a system that automates and applies this theory. The algorithm utilized by GALOIS to build a concept lattice is incremental and efficient, each update being done in time at most quadratic in the number of objects in the lattice. Also, the algorithm may incorporate background information into the lattice, and through clustering, extend the scope of the theory. The application we present is concerned with information retrieval via browsing, for which we argue that concept lattices may represent major support structures. We describe a prototype user interface for browsing through the concept lattice of a document-term relation, possibly enriched with a thesaurus of terms. An experimental evaluation of the system performed on a medium-sized bibliographic database shows good retrieval performance and a significant improvement after the introduction of background knowledge.
Variations in relevance assessments and the measurement of retrieval effectiveness
- Journal of the American Society for Information Science
, 1996
"... The purpose of this article is to bring attention to the prob-lem of variations in relevance assessments and the effects that these may have on measures of retrieval effective-ness. Through an analytical review of the literature, I show that despite known wide variations in relevance assess-ments in ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
The purpose of this article is to bring attention to the prob-lem of variations in relevance assessments and the effects that these may have on measures of retrieval effective-ness. Through an analytical review of the literature, I show that despite known wide variations in relevance assess-ments in experimental test collections, their effects on the measurement of retrieval performance are almost com-pletely unstudied. I will further argue that what we know about the many variables that have been found to affect relevance assessments under experimental conditions, as well as our new understanding of psychological, situa-tional, user-based relevance, point to a single conclusion. We can no longer rest the evaluation of information re-trieval systems on the assumption that such variations do not significantly affect the measurement of information re-trieval performance. A series of thorough, rigorous, and extensive tests is needed, of precisely how, and under what conditions, variations in relevance assessments do, and do not, affect measures of retrieval performance. We need to develop approaches to evaluation that are sensi-tive to these variations and to human factors and individual differences more generally. Our approaches to evaluation must reflect the real world of real users.
Evaluating User Interfaces to Information Retrieval Systems: A Case Study on User Support
- SIGIR'96
, 1996
"... Designing good user interfaces to information retrieval systems is a complex activity. The design space is large and evaluation methodologies that go beyond the classical precision and recall figures are not well established. In this paper we present an evaluation of an intelligent interface that co ..."
Abstract
-
Cited by 49 (13 self)
- Add to MetaCart
Designing good user interfaces to information retrieval systems is a complex activity. The design space is large and evaluation methodologies that go beyond the classical precision and recall figures are not well established. In this paper we present an evaluation of an intelligent interface that covers also the user-system interaction and measures user's satisfaction. More specifically, we describe an experiment that evaluates: (i) the added value of the semiautomatic query reformulation implemented in a prototype system; (ii) the importance of technical, terminological, and strategic supports and (iii) the best way to provide them. The interpretation of results leads to guidelines for the design of user interfaces to information retrieval systems and to some observations on the evaluation issue.
Evaluation of a Simple and Effective Music Information Retrieval Method
"... We developed, and then evaluated, a music information retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. The songs were converted to an interval-only representation of monophonic melodies and then fragmented t into length-n subsections called ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
We developed, and then evaluated, a music information retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. The songs were converted to an interval-only representation of monophonic melodies and then fragmented t into length-n subsections called n-grams. The length of these n-grams and the degree to which we precisely represent the intervals are variables analyzed in this paper. We constructed a collection of "musical word" databases using the text-based, SMART information retrieval system. A group of simulated queries, some of which contained simulated errors, was run against these databases. The results were evaluated using the normalized precision and normalized recall measures. Our concept of "musical words" shows great merit thus implying that useful MIR systems can be constructed simply and efficiently using pre-existing text-based information retrieval software. Second, this study is a formal and comprehensive evaluation of ...
Comparing Interactive Information Retrieval Systems Across Sites: The TREC-6 Interactive Track Matrix Experiment
, 1998
"... This is a case study in the design and analysis of a 9-site TREC-6 experiment aimed at comparing the performance of 12 interactive information retrieval (IR) systems on a shared problem: a question-answering task, 6 statements of information need, and a collection of 210,158 articles from the Financ ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
This is a case study in the design and analysis of a 9-site TREC-6 experiment aimed at comparing the performance of 12 interactive information retrieval (IR) systems on a shared problem: a question-answering task, 6 statements of information need, and a collection of 210,158 articles from the Financial Times of London 1991-1994. The study discusses the application of experimental design principles and the use of a shared control IR system in addressing the problems of comparing experimental interactive IR systems across sites: isolating the effects of topics, human searchers, and other site-specific factors within an affordable design. The results confirm the dominance of the topic effect, show the searcher effect is almost as often absent as present, and indicate that for several sites the 2-factor interactions are negligible. An analysis of variance found the system effect to be significant, but a multiple comparisons test found no significant pairwise differences. 1 Introduction T...

