Results 1 - 10
of
27
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
Building a distributed full-text index for the web
- ACM Trans. Inf. Syst
, 2001
"... We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creati ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.
Distributed query processing using partitioned inverted files
- In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE
, 2001
"... In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexe ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an inverted file. We adopt two distinct strategies of index partitioning in the distributed system, namely local index partitioning and global index partitioning. In both strategies, documents are ranked using the vector space model along with a document filtering technique for fast ranking. We evaluate and compare the impact of the two index partitioning strategies on query processing performance. Experimental results on retrieval efficiency show that, within our framework, the global index partitioning outperforms the local index partitioning. 1.
Optimizing Result Prefetching in Web Search Engines with Segmented Indices
- In VLDB
, 2001
"... We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages which search engines should prepare during the query processing phase. Search engine users have been observed to browse through very few pages of results for qu ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages which search engines should prepare during the query processing phase. Search engine users have been observed to browse through very few pages of results for queries which they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since most of the prefetched results will not be requested by the user who initiated the search. However, a policy which abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources as well. We argue that for a certain behavior of users, engines should prefetch a constant number of result pages per query. We define a concrete query processing model for search engines with segmented indices, and analyze the cost of such prefetching policies. Based on these costs, we show how to determine the constant which optimizes the prefetching policy. Our results are mostly applicable to local index partitions of the inverted files, but are also applicable to processing of short queries in global index architectures.
University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier
- In Proceedings of TREC 2004
, 2004
"... With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order to cope with the poorlyperforming queries, we use two pre-retrieval performance predictors and a weighting function recommender mechanism. We also test a new training approach for the automatic tuning of the term frequency normalisation parameters. In the Terabyte track, we employ a distributed version of Terrier and test the effectiveness of techniques, such as using the anchor text, pseudo query expansion and selecting different weighting models for each query. 1
Toward virtual community knowledge evolution
- Journal of Management Information Systems
, 2002
"... This paper puts forth a vision and a possible architecture for a community knowledge evolution system. We propose augmenting a multimedia document repository (digital library) with innovative knowledge evolution support, including computer-mediated communications, community process support, decision ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
This paper puts forth a vision and a possible architecture for a community knowledge evolution system. We propose augmenting a multimedia document repository (digital library) with innovative knowledge evolution support, including computer-mediated communications, community process support, decision support, advanced hypermedia features, and conceptual knowledge structures. These tools and the techniques developed around them would enable members of a virtual community to learn from, contribute to, and collectively build upon the community's knowledge and improve many member tasks. The resulting Collaborative Knowledge Evolution Support System (CKESS) would provide an enhanced digital library infrastructure serving as an ever-evolving repository of the community's knowledge, which members would actively use in everyday tasks and regularly update.
Load balancing for term-distributed parallel retrieval
, 2006
"... Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30 % improvement in query throughput when tested on an eight-node parallel computer system. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content analysis and indexing – indexing methods; H.3.2 [Information Storage and Retrieval]:
Performance analysis of distributed architectures to index one terabyte of text
- Proc. 26th European Conference on IR Research, volume 2997 of Lecture Notes in Computer Science
, 2004
"... Abstract. We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variabl ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract. We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load. 1
Scalable Distributed Architectures for Information Retrieval
, 1999
"... SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the In ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining ...

