Results 1 
6 of
6
On Supernode Transformation with Minimized Total Running Time
"... With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three parameters of supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, for sufficiently large supernodes and number of processors, and for the case where multiple supernodes are mapped to a single processor, we give an order n polynomial whose real positive roots include the optimal supernode size. For two special cases: (1) two dimensional algorithm problems and (2) ndimensional algorithm problems where the communication cost is dominated by the startup penalty and therefore, can be approximated by a constant, we give a closed form expression for the optimal supernode s...
Optimal Grain Size Computation for Pipelined Algorithms
 In Europar'96 Parallel Processing
, 1996
"... . In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytous ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
. In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytouse and efficient way to compute, in the general case, this optimal packet size, on the column LU factorization; the implementation and performance measures are made on an Intel Paragon. Keywords : Communications overlap, pipelined algorithms, optimal packet size computation. 1 Introduction Parallel distributed memory machines improve performances and memory capacity but their use adds an overhead due to the communications. To obtain programs that perform and scale well, this overhead must be hidden. Several solutions exist. The choice of a good data distribution is the first step that can be done to lower the number and the size of communications. Depending of the dependences within the cod...
On Optimal Size and Shape of Supernode Transformations
 Int'l Conf. on Parallel Proc
, 1996
"... ..."
On Supernode Partitioning Hyperplanes for Two Dimensional Algorithms
 Proceedings IASTED
, 1997
"... Supernode transformation has been proposed to reduce the communication startup cost by grouping a number of iterations in a loop as a supernode which is assigned to a processor as a single unit. A supernode transformation is specified by n families of hyperplanes which slice the iteration space into ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Supernode transformation has been proposed to reduce the communication startup cost by grouping a number of iterations in a loop as a supernode which is assigned to a processor as a single unit. A supernode transformation is specified by n families of hyperplanes which slice the iteration space into parallelepiped supernodes, the grain size of a supernode, and the relative side lengths of the parallelepiped supernode. The total running time is affected by these three factors. In this paper, we discuss how to find supernode partitioning hyperplanes assuming a given grain size. We prove that for two dimensional algorithms, the supernode partitioning with the two extreme dependence vectors as its hyperplane directions is optimal with respect to the total execution time. Keywords Supernode partitioning, tiling, parallelizing compilers, distributed memory multicomputer, minimizing running time. 1 INTRODUCTION A problem in distributed memory parallel systems is the communication startu...
Time Optimal Supernode Shape For Algorithms With n Extreme Dependence Directions
, 1998
"... With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode shape of a supernode transformation (also known as tiling). We assume the communication cost to be dominated by the startup ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode shape of a supernode transformation (also known as tiling). We assume the communication cost to be dominated by the startup penalty and therefore, can be approximated by a constant. We identify three parameters of supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, we give a closed form for an optimal linear schedule vector, a necessary and sufficient condition for an optimal relative side lengths, and for dependence cones with n extreme directions, we prove that the total running time is minimized for the cutting hyperplane directions corresponding to the n extreme directions of the dependence cone. The results are derived in continuous space and should for that reason be considered approximate. ...
1Communicationaware Supernode Shape
"... Abstract — In this paper we revisit the supernodeshape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parall ..."
Abstract
 Add to MetaCart
Abstract — In this paper we revisit the supernodeshape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on schedulingaware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to miminize this metric by selecting a communicationaware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI Cart Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.