Results 1 - 10
of
13
Cache-Conscious Automata for XML Filtering
- In ICDE
, 2005
"... Hardware cache behavior is an important factor in the performance of memory-resident, data-intensive systems such as XML filtering engines. A key data structure in several recent XML filters is the automaton, which is used to represent the long-running XML queries in the main memory. In this paper, ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Hardware cache behavior is an important factor in the performance of memory-resident, data-intensive systems such as XML filtering engines. A key data structure in several recent XML filters is the automaton, which is used to represent the long-running XML queries in the main memory. In this paper, we study the cache performance of automaton-based XML filtering through analytical modeling and system measurement. Furthermore, we propose a cache-conscious automaton organization technique, called the hot buffer, to improve the locality of automaton state transitions. Our results show that (1) our cache performance model for XML filtering automata is highly accurate and (2) the hot buffer improves the cache performance as well as the overall performance of automaton-based XML filtering. 1
Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct
"... The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of pro ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small team of programmers, Self-managing, to let it run out-of-the-box without hassle. In this paper, we provide a trip report on this quest, covering both past experiences, ongoing research on hardware-conscious algorithms, and novel ways towards self-management specifically focused on column store solutions. 1.
DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing
- In DaMoN ’08: Proceedings of the 4th international workshop on Data management on new hardware
, 2008
"... Comparisons between the merits of row-wise storage (NSM) and columnar storage (DSM) are typically made with respect to the persistent storage layer of database systems. In this paper, however, we focus on the CPU efficiency tradeoffs of tuple representations inside the query execution engine, while ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Comparisons between the merits of row-wise storage (NSM) and columnar storage (DSM) are typically made with respect to the persistent storage layer of database systems. In this paper, however, we focus on the CPU efficiency tradeoffs of tuple representations inside the query execution engine, while tuples flow through a processing pipeline. We analyze the performance in the context of query engines using so-called ”block-oriented ” processing – a recently popularized technique that can strongly improve the CPU efficiency. With this high efficiency, the performance trade-offs between NSM and DSM can have a decisive impact on the query execution performance, as we demonstrate using both microbenchmarks and TPC-H query 1. This means that NSM-based database systems can sometimes benefit from converting tuples into DSM on-the-fly, and vice versa. 1.
Read-Optimized Databases, In Depth
"... Recently, a number of papers have been published showing the benefits of column stores over row stores. However, the research comparing the two in an “apples-to-apples ” way has left a number of unresolved questions. In this paper, we first discuss the factors that can affect the relative performanc ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Recently, a number of papers have been published showing the benefits of column stores over row stores. However, the research comparing the two in an “apples-to-apples ” way has left a number of unresolved questions. In this paper, we first discuss the factors that can affect the relative performance of each paradigm. Then, we choose points within each of the factors to study further. Our study examines five tables with various characteristics and different query workloads in order to obtain a greater understanding and quantification of the relative performance of column stores and row stores. We then add materialized views to the analysis and see how much they can help the performance of row stores. Finally, we examine the performance of hash join operations in column stores and row stores. 1.
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
"... MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely rowstores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo! I.
Cache-Oblivious Databases: Limitations and Opportunities
, 2008
"... Cache-oblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These propert ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Cache-oblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any hardware platform specific tuning. These properties are highly attractive to autonomous databases, especially because the hardware architectures are becoming increasingly complex and diverse. In this paper, we present our design, implementation, and evaluation of the first cache-oblivious in-memory query processor, EaseDB. Moreover, we discuss the inherent limitations of the cacheoblivious approach as well as the opportunities given by the upcoming hardware architectures. Specifically, a cache-oblivious technique usually requires sophisticated algorithm design to achieve a comparable performance to its cache-conscious counterpart. Nevertheless, this developmenttime effort is compensated by the automaticity of performance achievement and the reduced ownership cost. Furthermore, this automaticity enables cache-oblivious techniques to outperform their cache-conscious counterparts in multi-threading processors.
Industrial SponsorsFOREWARD Objective
"... The aim of this one-day workshop is to bring together researchers who are interested in optimizing database performance on modern computing infrastructure by designing new data management techniques and tools. Topics of Interest The continued evolution of computing hardware and infrastructure impose ..."
Abstract
- Add to MetaCart
The aim of this one-day workshop is to bring together researchers who are interested in optimizing database performance on modern computing infrastructure by designing new data management techniques and tools. Topics of Interest The continued evolution of computing hardware and infrastructure imposes new challenges and bottlenecks to program performance. As a result, traditional database architectures that focus solely on I/O optimization increasingly fail to utilize hardware resources efficiently. CPUs with superscalar out-of-order execution, simultaneous multi-threading, multi-level memory hierarchies, and future storage hardware (such as flash drives) impose a great challenge to optimizing database performance. Consequently, exploiting the characteristics of modern hardware has become an important topic of database systems research. The goal is to make database systems adapt automatically to the sophisticated hardware characteristics, thus maximizing performance transparently to applications. To achieve this goal, the data management community needs interdisciplinary collaboration with computer architecture, compiler and operating systems researchers. This involves rethinking traditional
Avoiding Version Redundancy for High Performance Reads in Temporal DataBases
"... A major performance bottleneck for database systems is the memory hierarchy. The performance of the memory hierarchy is directly related to how the content of disk pages maps to the L2 cache lines, i.e. to the organization of data within a disk page, called the page layout. The prevalent page layout ..."
Abstract
- Add to MetaCart
A major performance bottleneck for database systems is the memory hierarchy. The performance of the memory hierarchy is directly related to how the content of disk pages maps to the L2 cache lines, i.e. to the organization of data within a disk page, called the page layout. The prevalent page layout in database systems is the N-ary Storage Model (NSM). As demonstrated in this paper, using NSM for temporal data deteriorates memory hierarchy performance for query-intensive workloads. This paper proposes two cacheconscious, read-optimized, page layouts for temporal data. Experiments show that the proposed page layouts are substantially faster than NSM. 1.
Revisiting Database Storage Optimizations on Flash
"... The database storage hierarchy has been heavily optimized for the performance characteristics of disks. Storage managers typically employ row- or column-oriented storage layouts, or a combination, to improve the I/O performance of different query workloads with disks. The recent rise of flash memory ..."
Abstract
- Add to MetaCart
The database storage hierarchy has been heavily optimized for the performance characteristics of disks. Storage managers typically employ row- or column-oriented storage layouts, or a combination, to improve the I/O performance of different query workloads with disks. The recent rise of flash memory-based solid-state drives (SSDs) significantly change the performance characteristics of storage: these drives provide an order of magnitude lower read/access latencies, significantly higher read bandwidths, and most importantly, negligible seek overheads. In light of these differences, we analyze major storage optimizations for read-optimized databases. We examine the benefits of row and column-oriented storage layouts on flash SSDs. Our measurments span through different workload variations, including selectivity, projectivity and concurrency that affect query processing on flash. Further, we also investigate the cost and benefits of a set of database optimizations, including data compression, prefetching, and indexes on flash SSDs. Our analytical models back our experimental evaluation of the performance tradeoffs of these optimizations. Three of our key findings are: (1) SSDs scale up linearly with concurrent execution of database queries and outperform disks by up to a factor of two, (2) the low seek cost on SSDs makes columnstorage a better choice for laying out data on a variety of flash devices, (3) and that while data compression is useful to further leverage the bandwidth of flash, database prefetching has less benefit for flash storage. Finally, we present a list of design implications of our findings on future database and operating systems for effectively embracing flash storage.
March 2010Revisiting Database Storage Optimizations on Flash
"... The database storage hierarchy has been heavily optimized for the performance characteristics of disks. Storage managers typically employ row- or column-oriented storage layouts, or a combination, to improve the I/O performance of different query workloads with disks. The recent rise of flash memory ..."
Abstract
- Add to MetaCart
The database storage hierarchy has been heavily optimized for the performance characteristics of disks. Storage managers typically employ row- or column-oriented storage layouts, or a combination, to improve the I/O performance of different query workloads with disks. The recent rise of flash memory-based solid-state drives (SSDs) significantly change the performance characteristics of storage: these drives provide an order of magnitude lower read/access latencies, significantly higher read bandwidths, and most importantly, negligible seek overheads. In light of these differences, we analyze major storage optimizations for read-optimized databases. We examine the benefits of row and column-oriented storage layouts on flash SSDs. Our measurements span through different workload variations, including selectivity, projectivity and concurrency that affect query processing on flash. Further, we also investigate the cost and benefits of a set of database optimizations, including data compression, prefetching, and indexes on flash SSDs. We back our experimental evaluation with analytical models of the performance tradeoffs of these optimizations. Three of our key findings are: (1) SSDs scale up linearly with concurrent execution of database queries and outperform disks by up to a factor of two, (2) the low seek cost on SSDs makes columnstorage a better choice for laying out data on a variety of flash devices, (3) and that while data compression is useful to further leverage the bandwidth of flash, database prefetching has less benefit for flash storage. Finally, we present a list of design implications of our findings on future database and operating systems for effectively embracing flash storage.

