Results 1 - 10
of
41
Performance of Compressed Inverted List Caching in Search Engines
- In WWW
, 2008
"... Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy th ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination. We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size.
Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS
- In Proc. of the 33 rd Intl. Conf. on Very Large Databases (VLDB
, 2007
"... This paper analyzes the performance of concurrent (index) scan operations in both record (NSM/PAX) and column (DSM) disk storage models and shows that existing scheduling policies do not fully exploit data-sharing opportunities and therefore result in poor disk bandwidth utilization. We propose the ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
This paper analyzes the performance of concurrent (index) scan operations in both record (NSM/PAX) and column (DSM) disk storage models and shows that existing scheduling policies do not fully exploit data-sharing opportunities and therefore result in poor disk bandwidth utilization. We propose the Cooperative Scans framework that enhances performance in such scenarios by improving data-sharing between concurrent scans. It performs dynamic scheduling of queries and their data requests, taking into account the current system situation. We first present results on top of an NSM/PAX storage layout, showing that it achieves significant performance improvements over traditional policies in terms of both the number of I/Os and overall execution time, as well as latency of individual queries. We provide benchmarks with varying system parameters, data sizes and query loads to confirm the improvement occurs in a wide range of scenarios. Then we extend our proposal to a more complicated DSM scenario, discussing numerous problems related to the two-dimensional nature of disk scheduling in column stores. 1.
Super-scalar database compression between ram and cpu-cache
- MS Thesis, Centrum voor Wiskunde en Informatica (CWI
, 2005
"... Information Access, a subdivision of the research cluster Information Systems. This thesis is ready to be marked. Date: Author’s signature: This thesis is ready to be verified by the second reader. Date: Supervisor’s signature: Data-intensive query processing tasks like data mining, scientific data ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Access, a subdivision of the research cluster Information Systems. This thesis is ready to be marked. Date: Author’s signature: This thesis is ready to be verified by the second reader. Date: Supervisor’s signature: Data-intensive query processing tasks like data mining, scientific data analysis, and decision support can leave a database system severely I/O-bound, even when common RAID configurations are used. Traditionally, this problem has been tackled by adding more and more disks, connected through expensive interconnect networks. This brute-force approach results in systems of which the price is dominated by the cost of their disk subsystems and a lot of disk space is wasted as disks are only added to gain bandwidth. A more subtle and cost-effective solution can be found in data compression, which has the potential to alleviate the I/O bottleneck. However, traditional algorithms like Huffman coding, Arithmetic coding and Lempel-Ziv style dictionary methods are not suited for this goal due to high processing overheads. In
Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct
"... The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of pro ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of programmers, Self-managing, to let it run out-of-the-box without hassle. In this paper, we provide a trip report on this quest, covering both past experiences, ongoing research on hardware-conscious algorithms, and novel ways towards self-management specifically focused on column store solutions. 1.
DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing
- In DaMoN ’08: Proceedings of the 4th international workshop on Data management on new hardware
, 2008
"... Comparisons between the merits of row-wise storage (NSM) and columnar storage (DSM) are typically made with respect to the persistent storage layer of database systems. In this paper, however, we focus on the CPU efficiency tradeoffs of tuple representations inside the query execution engine, while ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Comparisons between the merits of row-wise storage (NSM) and columnar storage (DSM) are typically made with respect to the persistent storage layer of database systems. In this paper, however, we focus on the CPU efficiency tradeoffs of tuple representations inside the query execution engine, while tuples flow through a processing pipeline. We analyze the performance in the context of query engines using so-called ”block-oriented ” processing – a recently popularized technique that can strongly improve the CPU efficiency. With this high efficiency, the performance trade-offs between NSM and DSM can have a decisive impact on the query execution performance, as we demonstrate using both microbenchmarks and TPC-H query 1. This means that NSM-based database systems can sometimes benefit from converting tuples into DSM on-the-fly, and vice versa. 1.
Compressing Term Positions in Web Indexes
"... Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inve ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.
Scalable Techniques for Document Identifier Assignment in Inverted Indexes
- WWW2010
, 2010
"... Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Severa ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Several authors have recently proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant improvements in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on Travelling Salesman or graph partitioning problems that achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Travelling Salesman computation on a reduced sparse graph obtained using Locally Sensitive Hashing, which achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
Improved Index Compression Techniques for Versioned Document Collections
"... Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been p ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been proposed. In this paper, we focus on versioned document collections, that is, collections where each document is modified over time, resulting in multiple versions of the document. Consecutive versions of the same document are often similar, and several researchers have explored ideas for exploiting this similarity to decrease index size. We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods. In particular, we first propose several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications. Based on the lessons learned, we then propose additional techniques that come close to the sizes of the bitwise technique while also improving on the speed of the best previous methods.
ABSTRACT Architecture-Conscious Hashing
"... Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performa ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particular, we discuss fast hash functions, ways to efficiently handle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dynamically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partitioning works as a separate preparatory phase, forcing materialization, which may require I/O if the stream does not fit in RAM. We introduce best-effort partitioning, a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cacheline alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-efficient hashing and best-effort partitioning are combined. 1.
Efficient and flexible information retrieval using MonetDB/X100
- In CIDR 2007, Third Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings
, 2007
"... Today’s large-scale IR systems are not implemented using general-purpose database systems, as the latter tend to be significantly less efficient than custom-built IR engines. This paper demonstrates how recent developments in hardwareconscious database architecture may however satisfy IR needs. The ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Today’s large-scale IR systems are not implemented using general-purpose database systems, as the latter tend to be significantly less efficient than custom-built IR engines. This paper demonstrates how recent developments in hardwareconscious database architecture may however satisfy IR needs. The advantage is flexibility of experimentation, as implementing a retrieval system on top of a DBMS boils down to relational query formulation, rather than system programming. We demonstrate in the context of the TeraByte TREC efficiency task that our experimental MonetDB/X100 database system provides highly competitive results both regarding precision and speed. We analyze the two innovations in MonetDB/X100 that most contributed to this successful application of DB technology in IR, namely vectorized incache processing and the use of two new light-weight compression schemes that work between the RAM and CPU cache memory levels. 1.

