Results 1 - 10
of
67
Modern information retrieval: a brief overview
- BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING
, 2001
"... For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) wa ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.
COMBINING APPROACHES TO INFORMATION RETRIEVAL
"... The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the W ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.
Engineering a multi-purpose test collection for Web retrieval experiments
, 2001
"... Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attribu ..."
Abstract
-
Cited by 73 (3 self)
- Add to MetaCart
Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text. Keywords: Web retrieval, Link-based ranking, Distributed information retrieval, Test collections
QProber: A system for automatic classification of hidden-web databases
- ACM TOIS
, 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract
-
Cited by 53 (11 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
Measuring Search Engine Quality
, 2001
"... The effectiveness of twenty public search engines is evaluated using TREC-inspired methods and a set of 54 queries taken from real Web search logs. The World Wide Web is taken as the test collection and a combination of crawler and text retrieval system is evaluated. The engines are compared on a ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
The effectiveness of twenty public search engines is evaluated using TREC-inspired methods and a set of 54 queries taken from real Web search logs. The World Wide Web is taken as the test collection and a combination of crawler and text retrieval system is evaluated. The engines are compared on a range of measures derivable from binary relevance judgments of the first seven live results returned. Statistical testing reveals a significant difference between engines and high inter-correlations between measures. Surprisingly, given the dynamic nature of the Web and the time elapsed, there is also a high correlation between results of this study and a previous study by Gordon and Pathak. For nearly all engines, there is a gradual decline in precision at increasing cutoff after some initial fluctuation. Performance of the engines as a group is found to be inferior to the group of participants in the TREC-8 Large Web task, although the best engines approach the median of those systems. Shortcomings of current Web search evaluation methodology are identified and recommendations are made for future improvements. In particular, the present study and its predecessors deal with queries which are assumed to derive from a need to find a selection of documents relevant to a topic. By contrast, real Web search reflects a range of other information need types which require different judging and different measures. The authors wish to acknowledge that this work was carried out partly within the Cooperative Research Centre for Advanced Computational Systems established under the Australian Government's Cooperative Research Centres Program. 1 1
The Philosophy of Information Retrieval Evaluation
- In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
"... Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cran eld evaluation paradigm. In the Cran- eld paradigm, researchers perform experiments on test collections to compare the relative eectiveness of dierent retrieval approaches. The test collections allow the resear ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cran eld evaluation paradigm. In the Cran- eld paradigm, researchers perform experiments on test collections to compare the relative eectiveness of dierent retrieval approaches. The test collections allow the researchers to control the eects of dierent system parameters, increasing the power and decreasing the cost of retrieval experiments as compared to user-based evaluations. This paper reviews the fundamental assumptions and appropriate uses of the Cran- eld paradigm, especially as they apply in the context of the evaluation conferences.
Document Expansion for Speech Retrieval
, 1999
"... Advances in automatic speech recognition allow us to search large speech collections using traditional information retrieval methods. The problem of "aboutness" for documents --- is a document about a certain concept --- has been at the core of document indexing for the entire history of IR. This p ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Advances in automatic speech recognition allow us to search large speech collections using traditional information retrieval methods. The problem of "aboutness" for documents --- is a document about a certain concept --- has been at the core of document indexing for the entire history of IR. This problem is more difficult for speech indexing since automatic speech transcriptions often contain mistakes. In this study we show that document expansion can be successfully used to alleviate the effect of transcription mistakes on speech retrieval. The loss
Probe, Count, and Classify: Categorizing Hidden-Web Databases
, 2001
"... The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawlers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web "crawlers." Recent studies have estimated the size of this "hidden web" to be 500 billion pages, while the size of the "crawlable" web is only an estimated two billion pages. Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. In this paper, we introduce a method for automating this classification process by using a small number of query probes. To classify a database, our algorithm does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of our technique over collections of real documents, including over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases. 1.
The CLEF Cross Language Image Retrieval Track (ImageCLEF) 2004
- MULTILINGUAL INFORMATION ACCESS FOR TEXT, SPEECH AND IMAGES: RESULT OF THE FIFTH CLEF EVALUATION CAMPAIGN, LECTURE NOTES IN COMPUTER SCIENCE
, 2005
"... In this paper we describe ImageCLEF 1, the cross language image retrieval track of the Cross Language Evaluation Forum (CLEF 3). We instigated and ran a pilot experiment in 2003 where participants submitted entries for an ad hoc bilingual image retrieval task on a collection of historic photographs ..."
Abstract
-
Cited by 40 (15 self)
- Add to MetaCart
In this paper we describe ImageCLEF 1, the cross language image retrieval track of the Cross Language Evaluation Forum (CLEF 3). We instigated and ran a pilot experiment in 2003 where participants submitted entries for an ad hoc bilingual image retrieval task on a collection of historic photographs from St. Andrews University Library. This was designed to simulate the situation in which users would express their search request in natural language but require visual documents in return. For 2004 we have extended the tasks to include a medical image retrieval task and a user-centred evaluation.
Relevance: A review of the literature and a framework for thinking on the notion in information science
- Eds.), Advances in Librarianship 6
, 1976
"... Relevance is a, if not even the, key notion in information science in general and information retrieval in particular. This two-part critical review traces and synthesizes the scholarship on relevance over the past 30 years or so and provides an updated framework within which the still widely disson ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Relevance is a, if not even the, key notion in information science in general and information retrieval in particular. This two-part critical review traces and synthesizes the scholarship on relevance over the past 30 years or so and provides an updated framework within which the still widely dissonant ideas and works about relevance might be interpreted and related. It is a continuation and update of a similar review that appeared in 1975 under the same title, considered here as being Part I. The present review is organized in two parts: Part II addresses the questions related to nature and manifestations of relevance, and Part III addresses questions related to relevance behavior and effects. In Part II, the nature of relevance is discussed in terms of meaning ascribed to relevance, theories used or proposed, and models that have been developed. The

