Results 1 - 10
of
48
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
A Local Search Mechanism for Peer-to-Peer Networks
, 2002
"... One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose two mechanisms for information retrieval in pure ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose two mechanisms for information retrieval in pure peer-to-peer networks. The first, the modified Breadth-First-Search (BFS) mechanism, is an extension of the current Gnuttela protocol, allows searching with keywords, and is designed to minimize the number of messages that are needed to search the network. The second, the Intelligent Search mechanism, uses the past behavior of the P2P network to further improve the scalability of the search procedure. In this algorithm, each peer autonomously decides which of its peers are most likely to answer a given query. The algorithm is entirely distributed, and therefore scales well with the size of the network. We implemented our mechanisms as middleware platforms. To show the advantages of our mechanisms we present experimental results using the middleware implementation.
ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
- In WebDB
, 2003
"... this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/ ..."
Abstract
-
Cited by 86 (3 self)
- Add to MetaCart
this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/
Optimized Index Structures for Querying RDF from the Web
, 2005
"... Storing and querying Resource Description Framework (RDF) data is one of the basic tasks within any Semantic Web application. A number of storage systems provide assistance for this task. However, current RDF database systems do not use optimized indexes, which results in a poor performance behavior ..."
Abstract
-
Cited by 52 (16 self)
- Add to MetaCart
Storing and querying Resource Description Framework (RDF) data is one of the basic tasks within any Semantic Web application. A number of storage systems provide assistance for this task. However, current RDF database systems do not use optimized indexes, which results in a poor performance behavior for querying RDF. In this paper we describe optimized index structures for RDF, show how to process and evaluate queries based on the index structure, describe a lightweight adaptable implementation in Java, and provide a performance comparison with existing RDF databases.
Improving Collection Selection with Overlap Awareness in P2P Search Engines
- In SIGIR
, 2005
"... Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insuffici ..."
Abstract
-
Cited by 45 (19 self)
- Add to MetaCart
Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insufficient for several applications in which the collections typically overlap. This is the case, for example, for the collections built by autonomous peers crawling the web. We argue for the extension of existing quality measures using estimators of mutual overlap among collections and present experiments in which this combination outperforms CORI, a popular approach based on quality estimation. We outline our prototype implementation of a P2P web search engine, coined MINERVA 1, that allows handling large amounts of data in a distributed and self-organizing manner. We conduct experiments which show that taking overlap into account during collection selection can drastically decrease the number of collections that have to be contacted in order to reach a satisfactory level of recall, which is a great step toward the feasibility of distributed web search.
Optimized Query Execution in Large Search Engines with Global Page Ordering
, 2003
"... Large web search engines have to answer thousands of queries per second with interactive response times. A major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the ran ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
Large web search engines have to answer thousands of queries per second with interactive response times. A major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. To address this issue, IR and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists.
Three-level caching for efficient query processing in large web search engines
- In Proc. of the 14th Int. World Wide Web Conference
, 2005
"... Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with thi ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
Efficient Single-Pass Index Construction for Text Databases
- Jour. of the American Society for Information Science and Technology
, 2003
"... Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approa ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Multicrawler: A pipelined architecture for crawling and indexing semantic web data
- In 5th International Semantic Web Conference
, 2006
"... Abstract. The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and describe and ev ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
Abstract. The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and describe and evaluate a five-step pipelined architecture to crawl and index data from both the traditional and the Semantic Web. 1
Optimizing Result Prefetching in Web Search Engines with Segmented Indices
- In VLDB
, 2001
"... We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages which search engines should prepare during the query processing phase. Search engine users have been observed to browse through very few pages of results for qu ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages which search engines should prepare during the query processing phase. Search engine users have been observed to browse through very few pages of results for queries which they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since most of the prefetched results will not be requested by the user who initiated the search. However, a policy which abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources as well. We argue that for a certain behavior of users, engines should prefetch a constant number of result pages per query. We define a concrete query processing model for search engines with segmented indices, and analyze the cost of such prefetching policies. Based on these costs, we show how to determine the constant which optimizes the prefetching policy. Our results are mostly applicable to local index partitions of the inverted files, but are also applicable to processing of short queries in global index architectures.

