Results 1 - 10
of
25
Multi-dimensional database allocation for parallel data warehouses
- Proc. 26th VLDB Conference
, 2000
"... Data allocation is a key performance factor for parallel database systems (PDBS). This holds especially for data warehousing environments where huge amounts of data and complex analytical queries have to be dealt with. While there are several studies on data allocation for relational PDBS, the speci ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Data allocation is a key performance factor for parallel database systems (PDBS). This holds especially for data warehousing environments where huge amounts of data and complex analytical queries have to be dealt with. While there are several studies on data allocation for relational PDBS, the specific requirements of data warehouses have not yet been sufficiently addressed. In this study, we consider the allocation of relational data warehouses based on a star schema and utilizing bitmap index structures. We investigate how a multi-dimensional hierarchical data fragmentation of the fact table supports queries referencing different subsets of the schema dimensions. Our analysis is based on realistic parameters derived from a decision support benchmark. The performance implications of different allocation choices are evaluated by means of a detailed simulation model. 1
Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics
- International Conference on Very Large Databases
, 2003
"... Database systems work hard to tune I/O performance, but do not always achieve the full performance potential of modern disk systems. Their abstracted view of storage components hides useful device-specific characteristics, such as disk track boundaries and advanced built-in firmware algorithms. This ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Database systems work hard to tune I/O performance, but do not always achieve the full performance potential of modern disk systems. Their abstracted view of storage components hides useful device-specific characteristics, such as disk track boundaries and advanced built-in firmware algorithms. This paper presents a new storage manager architecture, called Lachesis, that exploits and adapts to observable device-specific characteristics in order to achieve and sustain high performance. For DSS queries, Lachesis achieves I/O efficiency nearly equivalent to sequential streaming even in the presence of competing random I/O traffic. In addition, Lachesis simplifies manual configuration and restores the optimizer's assumptions about the relative costs of different access patterns expressed in query plans. Experiments using IBM DB2 I/O traces as well as a prototype implementation show that Lachesis improves standalone DSS performance by 10% on average. More importantly, when running concurrently with an on-line transaction processing (OLTP) workload, Lachesis improves DSS performance by up to 3 , while OLTP also exhibits a 7% speedup.
Automatic Optimization of Parallel Dataflow Programs
"... Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages en ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community. 1
Accurate Modeling of The Hybrid Hash Join Algorithm
- In Proc. 1994 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems
, 1994
"... : The join of two relations is an important operation in database systems. It occurs frequently in relational queries, and join performance is a significant factor in overall system performance. Cost models for join algorithms are used by query optimizers to choose efficient query execution strategi ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
: The join of two relations is an important operation in database systems. It occurs frequently in relational queries, and join performance is a significant factor in overall system performance. Cost models for join algorithms are used by query optimizers to choose efficient query execution strategies. This paper presents an efficient analytical model of an important join method, the hybrid hash join algorithm, that captures several key features of the algorithm's performance -- including its intra--operator parallelism, interference between disk reads and writes, caching of disk pages, and placement of data on disk(s). Validation of the model against a detailed simulation of a database system shows that the response time estimates produced by the model are quite accurate. 1 Introduction Relational database systems organize information into a collection of tables. The relational join operator is used to relate information from two or more tables. Thus, joins are a frequently occurrin...
Partitioning Key Selection for a Shared-Nothing Parallel Database System
- IBM Research Report RC
, 1994
"... A shared nothing database system which tries to leverage the knowledge of partitioning attributes of relations can outperform a system where such knowledge is either not available or not used. The performance improvements are typically obtained by function shipping more database operations (joins, a ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
A shared nothing database system which tries to leverage the knowledge of partitioning attributes of relations can outperform a system where such knowledge is either not available or not used. The performance improvements are typically obtained by function shipping more database operations (joins, aggregates etc.), thus minimizing the communication overhead. In such a system, it is critical that the correct partitioning keys are selected so that the query workload is optimized. Previous research has ignored the importance of selecting the partitioning keys and have mostly focused on the degree of declustering. In this study we show that by following a systematic methodology, especially for the partitioning key selection and associated relation grouping issues, the entire data placement strategy for a given database schema and workload can be determined in a very efficient manner. We describe different flavors of this methodology and demonstrate the performance improvements resulting fr...
Low Overhead Concurrency Control for Partitioned Main Memory Databases
"... Database partitioning is a technique for improving the performance of distributed OLTP databases, since “single partition” transactions that access data on one partition do not need coordination with other partitions. For workloads that are amenable to partitioning, some argue that transactions shou ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Database partitioning is a technique for improving the performance of distributed OLTP databases, since “single partition” transactions that access data on one partition do not need coordination with other partitions. For workloads that are amenable to partitioning, some argue that transactions should be executed serially on each partition without any concurrency at all. This strategy makes sense for a main memory database where there are no disk or user stalls, since the CPU can be fully utilized and the overhead of traditional concurrency control, such as two-phase locking, can be avoided. Unfortunately, many OLTP applications have some transactions which access multiple partitions. This introduces network stalls in order to coordinate distributed transactions, which will limit the performance of a database that does not allow concurrency. In this paper, we compare two low overhead concurrency control schemes that allow partitions to work on other transactions during network stalls, yet have little cost in the common case when concurrency is not needed. The first is a light-weight locking scheme, and the second is an even lighter-weight type of speculative concurrency control that avoids the overhead of tracking reads and writes, but sometimes performs work that eventually must be undone. We quantify the range of workloads over which each technique is beneficial, showing that speculative concurrency control generally outperforms locking as long as there are few aborts or few distributed transactions that involve multiple rounds of communication. On a modified TPC-C benchmark, speculative concurrency control can improve throughput relative to the other schemes by up to a factor of two.
Prefetching in Segmented Disk Cache for Multi-Disk Systems
- In Proceedings of the fourth workshop on I/O in parallel and distributed systems
, 1996
"... This paper investigates the performance of a multi-disk storage system equipped with a segmented disk cache processing a workload of multiple relational scans. Prefetching is a popular method of improving the performance of scans. Many modern disks have a multisegment cache which can be used for pre ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper investigates the performance of a multi-disk storage system equipped with a segmented disk cache processing a workload of multiple relational scans. Prefetching is a popular method of improving the performance of scans. Many modern disks have a multisegment cache which can be used for prefetching. We observe that, exploiting declustering as a data placement method, prefetching in a segmented cache causes a load imbalance among several disks. A single disk becomes a bottleneck, degrading performance of the entire system. A variation in disk queue length is a primary factor of the imbalance. Using a precise simulation model, we investigate several approaches to achieving better balancing. Our metrics are a scan response time for the closed-end system and an ability to sustain a workload without saturating for the open-end system. We arrive at two main conclusions: (1) Prefetching in main memory is inexpensive and effective for balancing and can supplement or substitute prefetc...
Handling Heterogeneity in Shared-Disk File Systems
- IN PROCEEDINGS OF THE 2003 ACM/IEEE CONFERENCE ON SUPERCOMPUTING (SC ’03
, 2003
"... We develop and evaluate a system for load management in shared-disk file systems built on clusters of heterogeneous computers. The system generalizes load balancing and server provisioning. It balances file metadata workload by moving file sets among cluster server nodes. It also responds to changi ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
We develop and evaluate a system for load management in shared-disk file systems built on clusters of heterogeneous computers. The system generalizes load balancing and server provisioning. It balances file metadata workload by moving file sets among cluster server nodes. It also responds to changing server resources that arise from failure and recovery and dynamically adding or removing servers. The system is adaptive and self-managing. It operates without any a-priori knowledge of workload properties or the capabilities of the servers. Rather, it continuously tunes load placement using a technique called adaptive, non-uniform (ANU) randomization. ANU randomization realizes the scalability and metadata reduction benefits of hash-based, randomized placement techniques. It also avoids hashing's drawbacks: load skew, inability to cope with heterogeneity, and lack of tunability. Simulation results show that our load-management algorithm performs comparably to a prescient algorithm.
Revisiting pipelined parallelism in multi-join query processing
- In VLDB
, 2005
"... Multi-join queries are the core of any integration service that integrates data from multiple distributed data sources. Due to the large number of data sources and possibly high volumes of data, the evaluation of multi-join queries faces increasing scalability concerns. State-of-the-art parallel mul ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Multi-join queries are the core of any integration service that integrates data from multiple distributed data sources. Due to the large number of data sources and possibly high volumes of data, the evaluation of multi-join queries faces increasing scalability concerns. State-of-the-art parallel multi-join query processing commonly assume that the application of maximal pipelined parallelism leads to superior performance. In this paper, we instead illustrate that this assumption does not generally hold. We investigate how best to combine pipelined parallelism with alternate forms of parallelism to achieve an overall effective processing strategy. A segmented bushy processing strategy is proposed. Experimental studies are conducted on an actual software system over a cluster of high-performance PCs. The experimental results confirm that the proposed solution leads to about 50 % improvement in terms of total processing time in comparison to existing state-of-the-art solutions. 1
Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks
- In ICDE 2006
, 2005
"... ..."

