Results 1 - 10
of
44
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 134 (13 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
Distributed Information Retrieval
- In: Advances in Information Retrieval
, 2000
"... A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to sa ..."
Abstract
-
Cited by 116 (18 self)
- Add to MetaCart
A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable. 1. INTRODUCTION Wide area networks, particularly the Internet, have transformed how people interact with information. Much of the routine information access by the general public is now based on full-text information retrieval, as opposed t...
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
Effective Retrieval with Distributed Collections
, 1998
"... This paper evaluates the retrieval effectiveness of distributed information retrieval systems in realistic environments. We find that when a large number of collections are available, the retrieval effectiveness is significantly worse than that of centralized systems, mainly because typical queries ..."
Abstract
-
Cited by 96 (13 self)
- Add to MetaCart
This paper evaluates the retrieval effectiveness of distributed information retrieval systems in realistic environments. We find that when a large number of collections are available, the retrieval effectiveness is significantly worse than that of centralized systems, mainly because typical queries are not adequate for the purpose of choosing the right collections. We propose two techniques to address the problem. One is to use phrase information in the collection selection index and the other is query expansion. Both techniques enhance the discriminatory power of typical queries for choosing the right collections and hence significantly improve retrieval results. Query expansion, in particular, brings the effectiveness of searching a large set of distributed collections close to that of searching a centralized collection. 1 Introduction In today's network environments, information is highly distributed. The Internet or World Wide Web, for example, contains thousands of collections. ...
Comparing the Performance of Database Selection Algorithms
, 1999
"... We compare the performance of two database selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for database selection techniques. The testbed is a decomposition of the TREC/- TIPSTER data into 236 subcollections. We present resu ..."
Abstract
-
Cited by 89 (23 self)
- Add to MetaCart
We compare the performance of two database selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for database selection techniques. The testbed is a decomposition of the TREC/- TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance- This work supported in part by DARPA contract N6600197 -C-8542 and NASA GSRP NGT5-50062. y This work supported in part by NSF, the Library of Congress, and the Department of Commerce under agre...
Server Ranking for Distributed Text Retrieval Systems on the Internet
- In Proceedings of the Fifth International Conference on Database Systems for Advanced Applications
, 1997
"... Keyword-based search services have become necessary tools for finding information resources on the Internet today. The rapid growth of information on the Internet renders centralized keyword index services incapable of collecting comprehensive resource meta-data in a timely manner. We argue that del ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
Keyword-based search services have become necessary tools for finding information resources on the Internet today. The rapid growth of information on the Internet renders centralized keyword index services incapable of collecting comprehensive resource meta-data in a timely manner. We argue that delegating the task of meta-data collection to local index servers is a more scalable approach. We propose a mechanism for integrating distributed autonomous index servers into a cooperative resource discovery system. Focusing on the retrieval effectiveness of the system, we propose a set of methods, called CVV-based methods, for ranking and selecting index servers with respect to a query, and merging the results returned by the index servers. Through experiments, we evaluate the effectiveness of the CVV-based methods, and compare our server ranking method with methods proposed by other researchers. Keywords information retrieval, internet data-- bases. 1 Introduction With the rapid growth ...
Evaluating Database Selection Techniques: A Testbed and Experiment
, 1998
"... We describe a testbed for database selection techniques and an experiment conducted using this testbed. The testbed is a decomposition of the TREC/TIPSTER data that allows analysis of the data along multiple dimensions, including collection-based and temporal-based analysis. We characterize the subc ..."
Abstract
-
Cited by 64 (14 self)
- Add to MetaCart
We describe a testbed for database selection techniques and an experiment conducted using this testbed. The testbed is a decomposition of the TREC/TIPSTER data that allows analysis of the data along multiple dimensions, including collection-based and temporal-based analysis. We characterize the subcollections in this testbed in terms of number of documents, queries against which the documents have been evaluated for relevance, and distribution of relevant documents. We then present initial results from a study conducted using this testbed that examines the effectiveness of the gGlOSS approach to database selection. The databases from our testbed were ranked using the gGlOSS techniques and compared to the gGlOSS Ideal(l) baseline and a baseline derived from TREC relevance judgements. We have examined the degree to which several gGlOSS estimate functions approximate these baselines. Our initial results confirm that the gGlOSS estimators are excellent predictors of the Ideal(l) ranks but...
Building a distributed full-text index for the web
- ACM Trans. Inf. Syst
, 2001
"... We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creati ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.
SETS: Search Enhanced by Topic Segmentation
, 2003
"... We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are joined together into a single network by long-distance links. Queries are then matched and routed to only the topically closest regions. We draw on ideas from machine learning and social network theory to build an efficient search network. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is ecient in network traffic and query processing load.
The Impact of Database Selection on Distributed Searching
- SIGIR
, 2000
"... The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retriev ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.

