## (De)Composition Rules for Parallel Scan and Reduction (1998)

Venue: | In Proc. 3rd Int. Working Conf. on Massively Parallel Programming Models (MPPM'97 |

Citations: | 9 - 1 self |

### BibTeX

@INPROCEEDINGS{Gorlatch98(de)compositionrules,

author = {Sergei Gorlatch and Christian Lengauer},

title = {(De)Composition Rules for Parallel Scan and Reduction},

booktitle = {In Proc. 3rd Int. Working Conf. on Massively Parallel Programming Models (MPPM'97},

year = {1998},

pages = {23--32},

publisher = {IEEE Computer Society Press}

}

### OpenURL

### Abstract

We study the use of well-defined building blocks for SPMD programming of machines with distributed memory. Our general framework is based on homomorphisms, functions that capture the idea of dataparallelism and have a close correspondence with collective operations of the MPI standard, e.g., scan and reduction. We prove two composition rules: under certain conditions, a composition of a scan and a reduction can be transformed into one reduction, and a composition of two scans into one scan. As an example of decomposition, we transform a segmented reduction into a composition of partial reduction and all-gather. The performance gain and overhead of the proposed composition and decomposition rules are assessed analytically for the hypercube and compared with the estimates for some other parallel models.

### Citations

1363 | The Essence of Functional Programming
- Wadler
- 1992
(Show Context)
Citation Context ...s like map and reduce to quite complex patterns of parallelism like general divide-andconquer [9] or more special-purpose skeletons. We discuss one particular class of skeletons, called homomorphisms =-=[1]-=-, which lie in the middle of this range. Due to their simplicity, homomorphism skeletons are well suited for the study of transformations in the quest for better performance. 1 The usefulness of homom... |

1148 |
Using MPI: Portable Parallel Programming with the Message-Passing Interface
- Gropp, Lusk, et al.
- 1999
(Show Context)
Citation Context ...x homomorphism be decomposed into simpler homomorphisms, with the result of improved performance? We present one such rule for segmented reduction, which is available as MPIAllred in the MPI standard =-=[11]-=-. ffl Performance: How portable are skeleton implementations ? Whereas the proposed (de)composition rules are implementation-independent, their impact on the target performance depends on the particul... |

224 |
The cube-connected cycles: a versatile network for parallel computation
- Preparata, Vuillemin
- 1979
(Show Context)
Citation Context ...Phi;\Omega )) (19) \Phi m\Omega = 1 ffi d=k (swap d (\Phi;\Omega )) (20) Note that the CDH and IDH can be viewed as formal descriptions of the classes of so-called ascending and descending algorithms =-=[16]-=-. 4.2. Segmented Reduction and Scan Theorem 3 provides a parallel implementation schema for all DHs. In a particular case, we just need to customize operators \Phi and\Omega ; for instance, the hyperc... |

84 | Universal computing
- McColl
- 1996
(Show Context)
Citation Context ...ine used. We provide parametrized analytical estimates for the proposed rules in the hypercube model, and demonstrate that the equilibrium of benefits and overhead changes for other models (e.g., BSP =-=[14]-=-) and implementations. All these issues are studied within a common transformational functional framework. We aim at SPMD programs on distributed memory, with collective operations of the MPI standard... |

71 |
Foundations of Parallel Programming
- Skillicorn
- 1994
(Show Context)
Citation Context ...uce functionsinits, which yields all initial segments of a list: inits [x 1 ; x 2 ; : : : ; x n ] = [ [x 1 ]; [x 1 ; x 2 ]; : : : ; [x 1 ; x 2 ; : : : ; x n ] ] and a few standard BMF transformations =-=[17]-=-: map (f ffi g) = map f ffi map g (12) scan (\Phi) = map (red (\Phi)) ffi inits (13) inits ffi map f = map (map f ) ffi inits (14) inits ffi scan (\Phi) = map (scan (\Phi)) ffi inits (15) The composit... |

60 | Powerlist: A structure for parallel recursion
- Misra
- 1994
(Show Context)
Citation Context ... implementation is derived in [10]. In this section, we first introduce briefly some necessary notation and then consider the use of DH in the design of parallel programs. DH is defined on powerlists =-=[15]-=- of length 2 k ; k = 0; 1; : : : ; with balanced concatenation. The definition makes use of function zip, which combines elements of two lists of equal length with operator fi : zip(fi) ([x 1 ; : : : ... |

58 |
Algebraic identities for program calculation
- Bird
- 1989
(Show Context)
Citation Context ...tention. An overview of some current work can be found in [5]. Transformation work related to ours has been carried out in BMF: a version of Theorem 1 in the sequential setting was proved and used in =-=[2, 18]-=-. An analog of our operator red (h\Phi; ), called recur-reduce, has been used to parallelize linear recurrences [3] and later to tabulate parallel implementations of linearly recursive programs [19]. ... |

55 |
Algorithmic skeletons: a structured approach to the management of parallel computation
- Cole
- 1988
(Show Context)
Citation Context ...ce-guided transformations of a program composed of such primitives. The approach has its roots in the world of functional programming where programming paradigms are captured as algorithmic skeletons =-=[4]-=-. An algorithmic skeleton is a higher-order function (a functional schema) which takes as parameters so-called customizing functions needed in a specific application. If the parallel nature of the pro... |

27 | Systematic efficient parallelization of scan and other list homomorphisms
- Gorlatch
- 1996
(Show Context)
Citation Context ... adapted easily to other topologies, e.g., meshes and multistage networks. 4.1. Distributable Homomorphisms (DH) A specialized subclass of homomorphisms, the class of distributable homomorphisms (DH) =-=[7]-=-, is particularly suited for the hypercube. A generic, architectureindependent implementation is derived in [10]. In this section, we first introduce briefly some necessary notation and then consider ... |

18 | Systematic extraction and implementation of divide-and-conquer parallelism
- Gorlatch
- 1996
(Show Context)
Citation Context ...can be turned into homomorphisms when tupled with auxiliary functions. Basically, this increses parallelism at the price of extra computations; a method for finding auxiliary functions is proposed in =-=[6, 8]-=-. More complex problems may require a composition or nest of several (almost-)homomorphisms. ffl Implementation: How can the homomorphism skeleton be implemented efficiently on parallel computers? For... |

14 | Virtual data structures
- Swierstra, Moor
- 1992
(Show Context)
Citation Context ...tention. An overview of some current work can be found in [5]. Transformation work related to ours has been carried out in BMF: a version of Theorem 1 in the sequential setting was proved and used in =-=[2, 18]-=-. An analog of our operator red (h\Phi; ), called recur-reduce, has been used to parallelize linear recurrences [3] and later to tabulate parallel implementations of linearly recursive programs [19]. ... |

13 |
Calculating recurrences using the Bird-Meertens Formalism
- Cai, Skillicorn
- 1994
(Show Context)
Citation Context ... in BMF: a version of Theorem 1 in the sequential setting was proved and used in [2, 18]. An analog of our operator red (h\Phi; ), called recur-reduce, has been used to parallelize linear recurrences =-=[3]-=- and later to tabulate parallel implementations of linearly recursive programs [19]. Our transformation rule (11) differs in that it aims directly at the scan-reduce composition, and is arguably more ... |

9 | Parallelizing functional programs by generalization
- GESER, GORLATCH
- 1997
(Show Context)
Citation Context ...can be turned into homomorphisms when tupled with auxiliary functions. Basically, this increses parallelism at the price of extra computations; a method for finding auxiliary functions is proposed in =-=[6, 8]-=-. More complex problems may require a composition or nest of several (almost-)homomorphisms. ffl Implementation: How can the homomorphism skeleton be implemented efficiently on parallel computers? For... |

9 | Optimizing Compositions of Scans and Reductions in Parallel Program Derivation
- Gorlatch
- 1997
(Show Context)
Citation Context ...on and optimization techniques. The skeletons considered in the literature range from very simple functions like map and reduce to quite complex patterns of parallelism like general divide-andconquer =-=[9]-=- or more special-purpose skeletons. We discuss one particular class of skeletons, called homomorphisms [1], which lie in the middle of this range. Due to their simplicity, homomorphism skeletons are w... |

5 | Formal Derivation of Divide-and-Conquer Programs: A Case Study in the Multidimensional FFTâ€™s - Gorlatch, Bischof - 1997 |

4 |
et al. Introduction to Parallel Computing
- Kumar
- 1994
(Show Context)
Citation Context ... to studying the implementation of these standard primitives on particular parallel architectures. We choose here one particular network topology: the hypercube with cut-through routing. As argued in =-=[13]-=-, for a large class of problems, hypercube algorithms are asymptotically as fast as the optimal PRAM algorithms, and they can be adapted easily to other topologies, e.g., meshes and multistage network... |

2 | Parallel implementations of combinations of broadcast, reduction and scan
- Wedler, Lengauer
- 1997
(Show Context)
Citation Context ...2, 18]. An analog of our operator red (h\Phi; ), called recur-reduce, has been used to parallelize linear recurrences [3] and later to tabulate parallel implementations of linearly recursive programs =-=[19]-=-. Our transformation rule (11) differs in that it aims directly at the scan-reduce composition, and is arguably more convenient for use in algorithm design and MPI programming in practice. To the best... |

1 |
The BSP Tutorial at EuroPar '97
- Hill, Skillicorn
- 1997
(Show Context)
Citation Context ...ations of scan and reduction may exemplify completely different communication patterns and, thus, must be assessed separately. Let us assess the scan-scan transformation in the BSP model. As shown in =-=[12], the opti-=-mal, "transpose scan" algorithm has BSP costs: 2 \Delta (m \Delta g + l ) +m where l is the cost of the barrier synchronization, and g is the single-word delivery cost, both normalized by th... |