Results 1 - 10
of
53
Three-level caching for efficient query processing in large web search engines
- In Proc. of the 14th Int. World Wide Web Conference
, 2005
"... Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with thi ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
Efficient Single-Pass Index Construction for Text Databases
- Jour. of the American Society for Information Science and Technology
, 2003
"... Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approa ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for . . .
- IN PROCEEDINGS OF THE 27TH CONFERENCE ON AUSTRALASIAN COMPUTER SCIENCE
, 2004
"... Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main alternative strategies for index update: in-place update, index merging, and complete re-build. Our experiments with large volumes of web data show that re-merge is for large numbers of updates the fastest approach, but in-place update is suitable when the rate of update is low or buffer size is limited.
Efficient query processing in geographic web search engines
- In SIGMOD
, 2006
"... Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academi ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces. 1.
Performance of Compressed Inverted List Caching in Search Engines
- In WWW
, 2008
"... Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy th ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination. We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size.
A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems
, 2006
"... We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler di ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler divergence from the text collection's global language model. Our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e#ectiveness. It thus allows us to make the index small enough to fit entirely into the main memory of a single PC, even for large text collections containing millions of documents. This results in great e#ciency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the GOV2 document collection.
Indexing time vs. query time: tradeoffs in dynamic information retrieval systems
- Proc. 14th ACM Intl. Conf. on Information and Knowledge Management (CIKM
, 2005
"... We examine issues in the design of fully dynamic information retrieval systems with support for instantaneous document insertions and deletions. We present one such system and discuss some of the major design decisions. These decisions affect both the indexing and the query processing efficiency of ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
We examine issues in the design of fully dynamic information retrieval systems with support for instantaneous document insertions and deletions. We present one such system and discuss some of the major design decisions. These decisions affect both the indexing and the query processing efficiency of our system and thus represent genuine trade-offs between indexing and query processing performance. Two aspects of the retrieval system – fast, incremental updates and garbage collection for delayed document deletions – are discussed in detail, with a focus on the respective trade-offs. Depending on the relative number of queries and update operations, different strategies lead to optimal overall performance. Special attention is given to a particular case of dynamic search systems – desktop and file system search. As one of the main results of this paper, we demonstrate how security mechanisms necessary for multiuser support can be extended to realize efficient document deletions.
Very Large Scale Retrieval and Web Search
, 2004
"... Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues have been addressed, and new methods have been tried, not only for retrieval, but also for test collection construction. Since the Web Track was a natural evolutionary step from the VLC Track, from here on we will refer to them as a single VLC/Web track. The corpora created in support of the track have been distributed to more than 120 organisations world wide; they are clearly being used for evaluation and research purposes well beyond the confines of TREC. Not only that but the Web Track model has been adopted for similar Japanese language evaluations within the context of NTCIR (NII-NACSIS Test Collection for IR Systems, research.nii. ac.jp/ntcir/index-en.html). Each editio
Access-Ordered Indexes
- NEW TOPOLOGICAL DESCRIPTORS. J. CHEM. INF. COMPUT. SCI. 1994
, 2004
"... Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations wit ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted indexes are highly optimised, and significant work has been undertaken over the past fifteen years to store, retrieve, compress, and understand heuristics for these structures. In this paper, we propose a new self-organising inverted index based on past queries. We show that this access-ordered index improves query evaluation speed by 25%--40% over a conventional, optimised approach with almost indistinguishable accuracy. We conclude that access-ordered indexes are a valuable new tool to support fast and accurate web search.
Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment
, 2005
"... We study the problem of evaluating ranked (top-k) queries on textual collections ranging from multiple gigabytes to terabytes in size. We focus on the case of a global index organization in a highly distributed environment, and consider a class of ranking functions that includes common variants of t ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
We study the problem of evaluating ranked (top-k) queries on textual collections ranging from multiple gigabytes to terabytes in size. We focus on the case of a global index organization in a highly distributed environment, and consider a class of ranking functions that includes common variants of the Cosine and Okapi measures. The main bottleneck in such a scenario is the amount of communication required during query evaluation. We propose several efficient query evaluation schemes and evaluate their performance. Our results on real search engine query traces and over 120 million web pages show that after careful optimization such queries can be evaluated at a reasonable cost, while challenges remain for even larger collections and more general classes of ranking functions. 1.

