Results 1 - 10
of
26
Efficient document retrieval in main memory
- In Proc. 30th ACM SIGIR
, 2007
"... Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse. We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniqu ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse. We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. We evaluate our techniques using Galago, a new retrieval system designed for efficient query processing. Our system achieves a 69 % improvement in query throughput over previous methods.
A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems
, 2006
"... We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler di ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler divergence from the text collection's global language model. Our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e#ectiveness. It thus allows us to make the index small enough to fit entirely into the main memory of a single PC, even for large text collections containing millions of documents. This results in great e#ciency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the GOV2 document collection.
Top-k Ranked Document Search in General Text Databases
"... Abstract. Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new se ..."
Abstract
-
Cited by 13 (10 self)
- Add to MetaCart
Abstract. Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued. 1
ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines
"... Results caching is an efficient technique for reducing the query processing load, hence it is commonly used in real search engines. This technique, however, bounds the maximum hit rate due to the large fraction of singleton queries, which is an important limitation. In this paper we propose ResIn- a ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Results caching is an efficient technique for reducing the query processing load, hence it is commonly used in real search engines. This technique, however, bounds the maximum hit rate due to the large fraction of singleton queries, which is an important limitation. In this paper we propose ResIn- an architecture that uses a combination of results caching and index pruning to overcome this limitation. We argue that results caching is an inexpensive and efficient way to reduce the query processing load and show that it is cheaper to implement compared to a pruned index. At the same time, we show that index pruning performance is fundamentally affected by the changes in the query traffic that the results cache induces. We experiment with real query logs and a large document collection, and show that the combination of both techniques enables efficient reduction of the query processing costs and thus is practical to use in Web search engines.
Efficient top-k querying over social-tagging networks
- In SIGIR
, 2008
"... Online communities have become popular for publishing and searching content, as well as for finding and connecting to other users. User-generated content includes, for example, personal blogs, bookmarks, and digital photos. These items can be annotated and rated by different users, and these social ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Online communities have become popular for publishing and searching content, as well as for finding and connecting to other users. User-generated content includes, for example, personal blogs, bookmarks, and digital photos. These items can be annotated and rated by different users, and these social tags and derived user-specific scores can be leveraged for searching relevant content and discovering subjectively interesting items. Moreover, the relationships among users can also be taken into consideration for ranking search results, the intuition being that you trust the recommendations of your close friends more than those of your casual acquaintances. Queries for tag or keyword combinations that compute and rank the top-k results thus face a large variety of options that complicate the query processing and pose efficiency challenges. This paper addresses these issues by developing an incremental top-k algorithm with two-dimensional expansions: social expansion considers the strength of relations among users, and semantic expansion considers the relatedness of different tags. It presents a new algorithm, based on principles of threshold algorithms, by folding friends and related tags into the search space in an incremental on-demand manner. The excellent performance of the method is demonstrated by an experimental evaluation on three real-world datasets, crawled from deli.cio.us, Flickr, and LibraryThing.
Dynamic index pruning for effective caching
, 2007
"... Search engines make use of inverted list caching in RAM and dynamic pruning schemes to reduce query evaluation times. While only a small portion of lists are processed with dynamic pruning, current systems still store the entire inverted list in cache. In this paper we investigate caching only the p ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Search engines make use of inverted list caching in RAM and dynamic pruning schemes to reduce query evaluation times. While only a small portion of lists are processed with dynamic pruning, current systems still store the entire inverted list in cache. In this paper we investigate caching only the pieces of the inverted lists that are actually used to answer a query during dynamic pruning. We examine an LRU cache model, and two recently proposed models. We also introduce a new dynamic pruning scheme for impactordered inverted lists. Using two large web collections and corresponding query logs we show that, using an LRU cache, our new pruning scheme reduces the number of disk accesses during query processing time by 7%–15 % over the state-of-the-art impact-ordered baseline, without reducing answer quality. Surprisingly, however, we discover that using our new pruning scheme makes little difference to disk traffic when the more sophisticated caching schemes are employed.
Incremental Cluster-Based Retrieval using Compressed Cluster-Skipping Inverted Files
"... We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our in ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation both best(-matching) clusters and best(-matching) documents of such clusters are computed together with a single posting list access per query term. As we switch from term to term, best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest is skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvements while yielding comparable or sometimes better effectiveness figures. Our experiments with various collections show that, the incremental-CBR strategy using compressed cluster-skipping inverted file significantly improves CPU time efficiency regardless of the query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.
Top-k Aggregation Using Intersections of Ranked Inputs
"... There has been considerable past work on efficiently computing top k objects by aggregating information from multiple ranked lists of these objects. An important instance of this problem is query processing in search engines: One has to combine information from several different posting lists (ranki ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
There has been considerable past work on efficiently computing top k objects by aggregating information from multiple ranked lists of these objects. An important instance of this problem is query processing in search engines: One has to combine information from several different posting lists (rankings) of web pages (objects) to obtain the top k web pages to answer user queries. Two particularly well-studied approaches to achieve efficiency in top-k aggregation include early-termination algorithms (e.g., TA and NRA) and preaggregation of some of the input lists. However, there has been little work on a rigorous treatment of combining these approaches. We generalize the TA and NRA algorithms to the case when preaggregated intersection lists are available in addition to the original lists. We show that our versions of TA and NRA continue to remain “instance optimal, ” a very strong optimality notion that is a highlight of the original TA and NRA algorithms. Using an index of millions of web pages and real-world search engine queries, we empirically characterize the performance gains offered by our new algorithms. We show that the practical benefits of intersection lists can be fully realized only with an early-termination algorithm.
Early Exit Optimizations for Additive Machine Learned Ranking Systems
"... Some commercial web search engines rely on sophisticated machine learning systems for ranking web documents. Due to very large collection sizes and tight constraints on query response times, online efficiency of these learning systems forms a bottleneck. An important problem in such systems is to sp ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Some commercial web search engines rely on sophisticated machine learning systems for ranking web documents. Due to very large collection sizes and tight constraints on query response times, online efficiency of these learning systems forms a bottleneck. An important problem in such systems is to speedup the ranking process without sacrificing much from the quality of results. In this paper, we propose optimization strategies that allow short-circuiting score computations in additive learning systems. The strategies are evaluated over a state-of-the-art machine learning system and a large, real-life query log, obtained from Yahoo!. By the proposed strategies, we are able to speedup the score computations by more than four times with almost no loss in result quality.
Dual-Sorted Inverted Lists
"... Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a ter ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.

