## Optimizing equijoin queries in distributed databases where relations are hash partitioned (1991)

### Cached

### Download Links

- [www.cis.njit.edu]
- [www.cis.njit.edu]
- [web.njit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | ACM TODS |

Citations: | 12 - 0 self |

### BibTeX

@ARTICLE{Shasha91optimizingequijoin,

author = {Dennis Shasha and Tsong-li Wang},

title = {Optimizing equijoin queries in distributed databases where relations are hash partitioned},

journal = {ACM TODS},

year = {1991},

volume = {16},

pages = {279--308}

}

### Years of Citing Articles

### OpenURL

### Abstract

Consider the class of distributed database systems consisting of a set of nodes connected by a high bandwidth network. Each node consists of a processor, a random access memory, and a slower but much larger memory such as a disk. There is no shared memory among the nodes. The data are horizontally partitioned often using a hash function. Such a description characterizes many parallel or distributed database systems that have recently been proposed, both commercial and academic. We study the optimization problem that arises when the query processor must repartition the relations and intermediate results participating in a multijoin query. Using estimates of the sizes of intermediate relations, we show (1) optimum solutions for closed chain queries; (2) the NP-completeness of the optimization problem for star, tree, and general graph queries; and (3) effective heuristics for these hard cases. Our general approach and many of our results extend to other attribute partitioning schemes, for example, sort-partitioning on attributes, and to partitioned object databases.

### Citations

10888 |
Computers and Intractability: A guide to the theory of NP-completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...ny other relation in the tree.g So the number of repartitioning must be at least the sum of these nodecosts. We show that the special case is NP-hard by transforming the following hitting set problem =-=[14]-=- to it. Hitting Set. Given a collection T of subsets of a finite set S and a positive integer K, decide whether there is a subset S’ of S with IIS II s K such that S’ contains at least one element fro... |

531 | Shortest connection networks and some generalizations - Prim - 1957 |

244 | Extending the database relational model to capture more meaning
- Codd
(Show Context)
Citation Context ...n p #, one copy of SUP-PART on s #, and another copy of SUP-PART on p #. When repartitioning Tz to SUP-PART. p #, instead of sending its complete tuples, the second strategy could send the surrogates =-=[8]-=- of tuples in SUP-PART, along with the complete tuples in SUP. These surrogates would then be materialized at each node by consulting the local fragment of SUP-PART that is partitioned on p #. Suppose... |

177 | Join Processing in Database Systems with Large Main Memories - Shapiro - 1986 |

100 | Data placement in Bubba - Copeland, Alexander, et al. - 1988 |

81 | Gamma - A High Performance Dataflow Database Machine - DeWitt, Gerber, et al. - 1986 |

71 |
Join and semijoin algorithms for a multiprocessor database machine
- Valduriez, Gardarin
- 1984
(Show Context)
Citation Context ...systems that have recently been developed, including commercially available ones such as the Teradata machine [33], as well as research prototypes such as Bubba [91, Gamma [111, Grace [18], and Sabre =-=[34]-=-. In this paper we are concerned with optimizing equijoin queries in such systems. We restrict ourselves to queries that retrieve all fields of joined relations. 2 For our purposes, we assume that rel... |

65 |
Query Size Estimation by Adaptive Sampling
- Lipton, Naughton
- 1990
(Show Context)
Citation Context ...ed on “A practical approach would be to estimate the selectivity factors like those used in most real systems (see, e.g., [26]). Readers may also refer to Gardy and Puech [13] and Lipton and Naughton =-=[21]-=- for theoretical analyses of the sizes of intermediate relations. ACM Transactions on Database Systems, Vol 16, No. 2, June 1991.s306 . D. Shasha and T. L. Wang {R. A, S. C}, treating {R. B, S. D} as ... |

60 |
Optimization algorithms for distributed queries
- Apers, Hevner, et al.
- 1983
(Show Context)
Citation Context ... query can be processed without data transfer. They proposed a linear time “fragment and replicate” algorithm for nonlocally processable queries. Relevant work has also been discussed by Apers et al. =-=[1]-=-, Wong [36], Pramanik and Vineyard [22], Stamos and Young [30], and others. What distinguishes us from these workers is that we seek the optimal join order when processing queries, starting with parti... |

55 |
Implementing a relational database by means of specialized hardware
- Babb
- 1979
(Show Context)
Citation Context ... repartition the relations and intermediate results participating in a multijoin query. Using estimates of the sizes of intermediate relations, we show (1) optimum solutions for closed chain queries; =-=(2)-=- the NP-completeness of the optimization problem for star, tree, and general graph queries; and (3) effective heuristics for these hard cases. Our general approach and many of our results extend to ot... |

55 | Application of Hash to Data Base Machine and Its Architecture - Kitsuregawa, Tanaka, et al. - 1983 |

27 |
A new way to compute the product and join of relations
- Kim
- 1980
(Show Context)
Citation Context ...d to be repartitioned. Also, we consider the measure of processing cost to be the number of bytes moved from disks, as opposed to the number of page fetches from disks often assumed in the literature =-=[4, 17, 24]-=-, though the two measures differ by only a constant additive factor. By including factors such as page size, 1/0 speed, and bandwidth, one can convert the byte units to corresponding time unit. Next, ... |

22 |
Access path selection in a relational database system
- Selinger, Astrahan, et al.
- 1979
(Show Context)
Citation Context ... S. D} }, where R is partitioned on A and S on C. One can process the join based on “A practical approach would be to estimate the selectivity factors like those used in most real systems (see, e.g., =-=[26]-=-). Readers may also refer to Gardy and Puech [13] and Lipton and Naughton [21] for theoretical analyses of the sizes of intermediate relations. ACM Transactions on Database Systems, Vol 16, No. 2, Jun... |

18 |
On the effect of join operations on relation sizes
- Gardy, Puech
- 1989
(Show Context)
Citation Context ... One can process the join based on “A practical approach would be to estimate the selectivity factors like those used in most real systems (see, e.g., [26]). Readers may also refer to Gardy and Puech =-=[13]-=- and Lipton and Naughton [21] for theoretical analyses of the sizes of intermediate relations. ACM Transactions on Database Systems, Vol 16, No. 2, June 1991.s306 . D. Shasha and T. L. Wang {R. A, S. ... |

12 |
Tree queries: a simple class of relational queries
- Goodman, Shmueli
- 1982
(Show Context)
Citation Context ...y we assume the reader is familiar with the standard terms, attribute, tuple, join, used in relational database systems. Define a relation schema R as a finite set of attributes { Al, AZ, . . . . A.} =-=[16]-=-. Associated with each attribute AZ, 1 s i s n, is a domain, denoted dom( A ,). A relation instance (or simply reldiorz) R on schema R is a finite set of mappings { tl, t2, . , tm} from R to the set o... |

10 |
Optimization of Join Operations in Horizontally Partitioned Database Systems
- Segev
(Show Context)
Citation Context ... spanning tree algorithm performs well when chains in a query graph are small. When chains are long, then a hybrid algorithm using the chain algorithm with Kruskal’s is best. Like other relevant work =-=[15, 25]-=-, the heuristics presented in the paper relied on the knowledge of selectivity factors, which were used to find out the sizes of intermediate results. This information, however, is not known a 14 Fort... |

9 |
Optimizing join queries in distributed databases
- Pramanik, Vineyard
- 1988
(Show Context)
Citation Context ...ransfer. They proposed a linear time “fragment and replicate” algorithm for nonlocally processable queries. Relevant work has also been discussed by Apers et al. [1], Wong [36], Pramanik and Vineyard =-=[22]-=-, Stamos and Young [30], and others. What distinguishes us from these workers is that we seek the optimal join order when processing queries, starting with partitioned data and ending with partitioned... |

9 |
Fragmentation: A technique for efficient query processing
- SACCO
- 1986
(Show Context)
Citation Context ...d to be repartitioned. Also, we consider the measure of processing cost to be the number of bytes moved from disks, as opposed to the number of page fetches from disks often assumed in the literature =-=[4, 17, 24]-=-, though the two measures differ by only a constant additive factor. By including factors such as page size, 1/0 speed, and bandwidth, one can convert the byte units to corresponding time unit. Next, ... |

6 | Query optimization in object-oriented database systems - Sun, Meng, et al. - 1990 |

2 |
An optimal algorithm for processing distributed star queries
- LI, K
- 1985
(Show Context)
Citation Context ...ng (1) Consider the query q = {{ R. A,’l’ B}, {2’ B,S C}, {R. A,S C}, {T. B, U. D}} (2) q’s Jom graph (3) q’ = {{R A, T.B}, {7’ B,S C}, {T, B,U D}} is equivalent to q, (4) { It, A,S. C} is redundant, =-=(5)-=- q+= {{ R, A, T. B}, {T, B, S. C}, {R A,S C}, {T, B, U, D}, {R A, U. D}, Fig. 4, Join graph representation of a query and its closure (1) Consider the query q and Its closure (expressed in terms of th... |

1 |
Parallel algorithms for the execution of relational database operations
- DEWITT, WILKINSON
- 1983
(Show Context)
Citation Context ...is also processed. (3) Then when joining ST with R based on, say {S. F, R. E}, the clause {T. B, R. D) enforces another selection condition on tuples of ST and R (see Figure 2c). (b) Tc1 s T C3 R (c) =-=(4)-=- Consider the query graph in Figure 2a. When joining S and T on the clause {S.C, 2’. B}, S needs to be partitioned on C and T on B. Then in joining ST and R, (i) if {S. F, R. E} is used as the join cl... |

1 | A methodology for interpreting tree queries mto optimal - CHIU, Ho, et al. - 1991 |

1 |
Relational algebra machine GRACE
- KITSUREGAWA, TANAKA, et al.
- 1983
(Show Context)
Citation Context ...ibuted database systems that have recently been developed, including commercially available ones such as the Teradata machine [33], as well as research prototypes such as Bubba [91, Gamma [111, Grace =-=[18]-=-, and Sabre [34]. In this paper we are concerned with optimizing equijoin queries in such systems. We restrict ourselves to queries that retrieve all fields of joined relations. 2 For our purposes, we... |

1 | Join Processing in a Symmetric Parallel Environment - Shasha, Spirakis - 1985 |

1 | A distributed database version of INGRES - NEUHOL - 1977 |

1 |
Dynamic rematerlalization: Processing distributed queries using redundant data
- WoN
- 1983
(Show Context)
Citation Context ... be processed without data transfer. They proposed a linear time “fragment and replicate” algorithm for nonlocally processable queries. Relevant work has also been discussed by Apers et al. [1], Wong =-=[36]-=-, Pramanik and Vineyard [22], Stamos and Young [30], and others. What distinguishes us from these workers is that we seek the optimal join order when processing queries, starting with partitioned data... |

1 |
P Algorithms to process distributed queries in fast local networks
- BRILL, D, et al.
- 1987
(Show Context)
Citation Context ... introduced the notion of remote semijoin, which can significantly reduce the communication cost while incurring higher processing cost. Fragments may also be replicated at different sites. Yu et al. =-=[38]-=- distinguished queries as locally processable and nonlocally processable. If a query is locally processable, then the query can be processed without data transfer. They proposed a linear time “fragmen... |