Results 1 - 10
of
12
SEEKing the Truth about Ad Hoc Join Costs
- VLDB Journal
, 1993
"... In this paper, we reexamine the results of prior work on methods for computing ad hoc joins. We develop a detailed cost model for predicting join algorithm performance, and we use the model to develop cost formulas for the major ad hoc join methods found in the relational database literature. We s ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
In this paper, we reexamine the results of prior work on methods for computing ad hoc joins. We develop a detailed cost model for predicting join algorithm performance, and we use the model to develop cost formulas for the major ad hoc join methods found in the relational database literature. We show that various pieces of "common wisdom" about join algorithm performance fail to hold up when analyzed carefully, and we use our detailed cost model to derive optimal buffer allocation schemes for each of the join methods examined here. We show that optimizing their buffer allocations can lead to large performance improvements, e.g., as much as a 400% improvement in some cases. We also validate our cost model's predictions by measuring an actual implementation of each join algorithm considered. The results of this work should be directly useful to implementors of relational query optimizers and query processing systems. 1 Introduction The join of two sets of tuples is a fundament...
Join Algorithm Costs Revisited
- VLDB Journal
, 1996
"... A method of analysing join algorithms based upon the time required to access, transfer and perform the relevant CPU based operations on a disk page is proposed. The costs of variations of several of the standard join algorithms, including nested block, sort-merge, GRACE hash and hybrid hash, are pre ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
A method of analysing join algorithms based upon the time required to access, transfer and perform the relevant CPU based operations on a disk page is proposed. The costs of variations of several of the standard join algorithms, including nested block, sort-merge, GRACE hash and hybrid hash, are presented. For a given total buffer size, the cost of these join algorithms depends on the parts of the buffer allocated for each purpose (for example, when joining two relations using the nested block join algorithm the amount of buffer space allocated for the outer and inner relations can significantly affect the cost of the join). Analysis of expected and experimental results of various join algorithms show that a combination of the optimal nested block and optimal GRACE hash join algorithms usually provide the greatest cost benefit. Algorithms to quickly determine the buffer allocation producing the minimal cost for each of these algorithms are presented. 1 Introduction In the past, the ...
Disk-Tape Joins: Synchronizing Disk and Tape Access
- In Joint International Conference on Measurement and Modeling of ComputerSystems. SIGMETRICS '95/PERFORMANCE '95
, 1995
"... Today large amounts of data are stored on tertiary storage media such as magnetic tapes and optical disks. DBMSs typically operate only on magnetic disks since they know how to maneuver disks and how to optimize accesses on them. Tertiary devices present a problem for DBMSs since these devices have ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Today large amounts of data are stored on tertiary storage media such as magnetic tapes and optical disks. DBMSs typically operate only on magnetic disks since they know how to maneuver disks and how to optimize accesses on them. Tertiary devices present a problem for DBMSs since these devices have dismountable media and have very different operational characteristics compared to magnetic disks. For instance, most tape drives offer very high capacity at low cost but are accessed sequentially, involve lengthy latencies, and deliver lower bandwidth. Typically, the scope of a DBMS's query optimizer does not include tertiary devices, and the DBMS might not even know how to control and operate upon tertiary-resident data. In a three-level hierarchy of storage devices (main memory, disk, tape), the typical solution is to elevate tape-resident data to disk devices, thus bringing such data into the DBMS' control, and then to perform the required operations on disk. This requires additional spa...
Fast Joins Using Join Indices
- The VLDB Journal
, 1998
"... Two new algorithms, "Jive-join" and "Slam-join," are proposed for computing the join of two relations using a join index. The algorithms are duals: Jive-join range-partitions input relation tuple-ids then processes each partition, while Slam-join forms ordered runs of input relation tuple-ids and th ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Two new algorithms, "Jive-join" and "Slam-join," are proposed for computing the join of two relations using a join index. The algorithms are duals: Jive-join range-partitions input relation tuple-ids then processes each partition, while Slam-join forms ordered runs of input relation tuple-ids and then merges the results. Both algorithms make a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a temporary file whose size is half that of the join index. Both algorithms require only that the number of blocks in main memory is of the order of the square root of the number of blocks in the smaller relation. By storing intermediate and final join results in a vertically partitioned fashion, our algorithms need to manipulate less data in memory at a given time than other algorithms. The algorithms are resistant to data skew and adaptive to memory fluctuations. Selection conditions can be incorporated into the algorithms. ...
Parallel Query Processing Using Shared Memory Multiprocessors and Disk Arrays
, 1992
"... my research. I will also remember that it was from them that I learned how to appreciate a good beer and enjoy a good party. I would like to thank my fellow students Yongdong Wang and Chuen-tsai Sun for their valuable friendship and for all their help. I also would like to thank Guangrui Zhu and Ya ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
my research. I will also remember that it was from them that I learned how to appreciate a good beer and enjoy a good party. I would like to thank my fellow students Yongdong Wang and Chuen-tsai Sun for their valuable friendship and for all their help. I also would like to thank Guangrui Zhu and Yan Wei for being two special friends and making my life more interesting. Many thanks v also go to my college friends Yuzheng Ding and Jiyang Liu. Our communications have always been an inspiring source in my life. Although my parents and my sister are an ocean away, they have offered me their constant love and encouragement throughout my study. I would like to take this opportunity to thank them for everything they have done for me. Last, but the most, I would like to thank my dear wife, Nanyan Xiong. Without her love, understanding and support throughout my Ph.D. program, this thesis would not have been possible. This thesis is dedicated to her as a small token of my deep appreciation.
Optimization of query evaluation algorithms
- ACM Trans. Database Syst
, 1979
"... A model of database storage and access is presented. The model represents many evaluation algorithms as special cases, and helps to break a complex algorithm into simple access operations. Generalized access cost equations associated with the model are developed and analyzed. Optimization of these c ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
A model of database storage and access is presented. The model represents many evaluation algorithms as special cases, and helps to break a complex algorithm into simple access operations. Generalized access cost equations associated with the model are developed and analyzed. Optimization of these cost equations yields an optimal access algorithm which can be synthesized by a query subsystem whose design is based on the modular access operations.
Relational joins for data on tertiary storage
- IN PROC. INTL. CONF. ON DATA ENGINEERING
, 1997
"... Despite the steady decrease in secondary storage prices and continuous, sometimes drastic, improvements in storage density, the data storage requirements of many organizations cannot be met economically using secondary storage alone. Tertiary storage solutions, especially automated magnetic tape lib ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Despite the steady decrease in secondary storage prices and continuous, sometimes drastic, improvements in storage density, the data storage requirements of many organizations cannot be met economically using secondary storage alone. Tertiary storage solutions, especially automated magnetic tape libraries, o er a much lower storage cost at a level of functionality that satis es the needs of many application types. Database management systems (DBMS) typically do not incorporate tertiary storage devices such as magnetic tapes and optical disks as a rst-class citizen in the storage hierarchy. The typical solution in bringing tertiary-resident data under the control of a DBMS is to use operating system facilities to copy the data to secondary storage, and then to perform query optimization and execution as if the data had been in secondary storage all along. This approach fails to recognize the opportunities for saving execution time and storage space if the data were accessed on tertiary devices directly and in parallel with other I/Os. In this paper we examine ways of joining two relations stored on magnetic tapes. Our earlier work has shown that when one relation is stored on tape and the other on disk, a parallel I/O variant of Nested Block Join performs quite well, given su cient main memory space. Work presented in this paper extends by considering the case where both relations are larger than available disk space. To tackle main memory size limitations, we focus on hashing-based solutions. We modi ed Grace Hash Join to handle a range of tape relation sizes and to exploit parallelism between disk and tape I/Os. We showhow disk and main memory space a ect the performance of tertiary joins and demonstrate how parallel I/O helps these methods save execution time as well as memory and disk space.
Faster Joins, Self-Joins and Multi-Way Joins Using Join Indices
- Data and Knowledge Engineering
, 1998
"... We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called "Jive-join" developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and tw ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called "Jive-join" developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary files that contain tuple identifiers but no input tuples. Stripe-join performs this efficiently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly efficient for self-joins. To our knowledge, Stripe-join is the first algorithm that, given a join index and a relation significantly larger than main memory, can perform a selfjoin with just a single pass over the input relation and without storing input tuples in intermediate files. Almost all the I/O is sequential, thus minimizing the...
Towards Optimal Storage Design for Efficient Query Processing in Relational Database Systems
, 1995
"... The placement of records, and the methods used to access them, can significantly affect the performance of query processing in a database management system. By making use of information about query patterns and their frequencies, we aim to design file organisations which optimally cluster records th ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The placement of records, and the methods used to access them, can significantly affect the performance of query processing in a database management system. By making use of information about query patterns and their frequencies, we aim to design file organisations which optimally cluster records through the use of indexes. Many different record indexing techniques can be used to cluster records. Multiattribute hash indexing is the indexing technique which we use to demonstrate the effectiveness of our proposals. We describe algorithms which exploit a clustering arrangement for range queries, join queries and other relational queries, and describe the costs of these algorithms. We compare the performance of various optimisation techniques for the problem of optimally clustering records using each of these algorithms. In general, designing optimal indexes to cluster records is NP-hard. We show that by combining heuristic and combinatorial algorithms, near-optimal indexes can be constru...
Joining Massive High-Dimensional Datasets
- In Proc. ICDE
, 2003
"... We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the giv ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for sequence datasets.

