## Runtime Support for Multi-Tier Programming of Block-Structured Applications on SMP Clusters (1997)

Venue: | International Scientific Computing in Object-Oriented Parallel Environments Conference (ISCOPE ’97 |

Citations: | 16 - 4 self |

### BibTeX

@INPROCEEDINGS{Fink97runtimesupport,

author = {Stephen J. Fink and Scott B. Baden},

title = {Runtime Support for Multi-Tier Programming of Block-Structured Applications on SMP Clusters},

booktitle = {International Scientific Computing in Object-Oriented Parallel Environments Conference (ISCOPE ’97},

year = {1997},

pages = {pages},

publisher = {Springer-Verlag}

}

### OpenURL

### Abstract

. We present a small set of programming abstractions to simplify efficient implementations for block-structured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of parallelism and locality. Additionally, to tolerate slow inter-node communication costs, KeLP 2.0 combines inspector /executor communication analysis with overlap of communication and computation. We illustrate how these programming abstractions hide the low-level details of thread management, scheduling, synchronization, and message-passing, but allow the programmer to express efficient algorithms with intuitive geometric primitives. 1 Introduction Multi-tier parallel computers, such as clusters of symmetric multiprocessors (SMPs), have emerged as important platforms for high-performance computing [1]. A multi-tier computer, with several levels of locality and parallelism, presents a more c...

### Citations

970 |
Performance Fortran Forum. High Performance Fortran Langauge Specification, Version 2.0
- High
- 1997
(Show Context)
Citation Context ...er to use multi-tier platforms efficiently, the programmer or compiler must orchestrate parallelism and locality to match the hardware capabilities. On single-tier parallel computers, MPI [2] and HPF =-=[3]-=- have emerged as standard approaches to portable parallel programming. However, the proper programming model for multi-tier parallel computers remains an unresolved issue. At present, the programmer f... |

798 | A High– Performance, Portable Implementation of the MPI Message Passing Interface Standard
- Gropp, Lusk, et al.
- 1996
(Show Context)
Citation Context ...ital AlphaServer 2100's running Digital UNIX 4.0. Each SMP has four Alpha 21064A processors, and each processor has a 4MB direct-mapped L2 cache. For inter-node communication, we rely on MPICH 1.0.12 =-=[8]-=- over an OC-3 ATM switch. Using a simple ring test we observe a message start time of 745 s and a peak bandwidth of 12 MB/sec. Unfortunately, we encountered severe problems with Digital UNIX 4.0 sched... |

304 |
Designing and Building Parallel programs
- Foster
- 1994
(Show Context)
Citation Context ...relaxation to solve Poisson's equation over a cube. The usual SPMD implementation employs a BLOCK data decomposition and carries additional ghost cells to buffer off-processor data (see, for example, =-=[8]-=-). Each relaxation consists of two steps: (1) communicate with nearest neighbors to exchange ghost cell values, and (2) independently relax on the local portion of the global mesh. Fig. 1 shows the sk... |

292 | The Fortran D Language Specification
- Fox, Hiranandani, et al.
- 1990
(Show Context)
Citation Context ...urs through messages. The remaining rows report results from multi-tier KeLP implementations, with and without overlap of communication. For multi-tier codes, we first partition the work with a BLOCK =-=[12]-=- decomposition between nodes, and then with a second-level partitioning within each node. The results show that the best multi-tier implementation outperforms the message-passing code by 186% on redbl... |

267 | The NAS Paral- lel Benchmarks 2.0
- Bailey, Harris, et al.
- 1995
(Show Context)
Citation Context ...ail elsewhere [10]. The third application is the NAS-FT benchmark, which solves a 3D diffusion equation using Fast Fourier Transform. We obtained MPI code for FT from the NAS Parallel Benchmarks v2.1 =-=[11]-=-. The multi-tier KeLP versions add a second level of parallelism to node-level kernels with domain decomposition. To overlap communication and computation, we pipeline the FFTs across iterations with ... |

78 | SUMMA: Scalable Universal Matrix Multiplication Algorithm. LAPACK Working Note 99, technical report
- Geijn, Watts
- 1995
(Show Context)
Citation Context ...s described in Section 3. The MPI version of this code uses BLOCK data decomposition on all three axes. The second application, SUMMA, implements dense matrix multiplication using the SUMMA algorithm =-=[9]-=-. The MPI code for SUMMA is listed in [9] and was made publicly available by the authors. The serial matrix multiply kernel calls vendor-provided BLAS. The multi-tier KeLP code for SUMMA uses a two-le... |

56 | An integrated runtime and compile-time approach for parallelizing structured and block structured applications
- Agrawal, Sussman, et al.
- 1995
(Show Context)
Citation Context ...ck data decomposition. Alternatively, a FloorPlan can represent distribution of work among processors of a single SMP. The MotionPlan implements a first-class, user-level block communication schedule =-=[7]-=-. The programmer builds and manipulates MotionPlans using geometric Region calculus operations. 2.2 Hierarchical Control Flow The multi-tier KeLP abstractions support three levels of control: a collec... |

55 | SIMPLE: A methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs
- Bader, JáJá
- 1999
(Show Context)
Citation Context ...ents a more complex, explicitly parallel model, but allows the programmer to express a wider class of algorithms and exert more control over the implementation. Bader and J'aJ'a have developed SIMPLE =-=[17]-=-, a set of collective communication operations for SMP clusters. SIMPLE provides more general, lower-level primitives than KeLP 2.0, and does not help with data decomposition or overlap of communicati... |

44 | Modeling parallel computers as memory hierarchies
- Alpern, Carter, et al.
- 1993
(Show Context)
Citation Context ...rresponding to collective(Y) and node(X) levels. Control flow in multi-tier KeLP programming model extends Snyder's XYZ program levels to multi-tier machines. Alpern, Carter, and Ferrante's PMH model =-=[14]-=- provides an elegant framework for multi-tier parallel architectures. The Cedar Fortran language [15] was perhaps the first language to incorporate two levels of parallelism in order to match a hierar... |

38 | Flexible communication mechanisms for dynamic structured applications
- Fink, Kohn, et al.
- 1996
(Show Context)
Citation Context ...97 We present a small set of programming abstractions to simplify implementation of efficient algorithms for block-structured scientific calculations on SMP clusters. This paper extends previous work =-=[5]-=- with two contributions specifically targeted for multi-tier architectures: hierarchical SPMD control flow, and overlap of communication and computation. We show how high level abstractions hide tedio... |

21 | A Parallel Software Infrastructure for Dynamic Block-Irregular Scientific Calculations
- Kohn
- 1995
(Show Context)
Citation Context ...ge-passing costs. 2 Programming Abstractions 2.1 Structural Abstraction The KeLP programming abstractions extend structural abstraction, a programming model introduced in the LPARX programming system =-=[6]-=-. Under structural abstraction, first-class meta-data objects represent the geometric structure of a calculation. Previous work describes KeLP abstractions to manage irregular block data decomposition... |

20 |
Perspective on supercomputing: Three decades of change
- Woodward
- 1996
(Show Context)
Citation Context ... intuitive geometric primitives. 1 Introduction Multi-tier parallel computers, such as clusters of symmetric multiprocessors (SMPs), have emerged as important platforms for high-performance computing =-=[1]-=-. A multi-tier computer, with several levels of locality and parallelism, presents a more complex non-uniform memory hierarchy than a single-tier multicomputer with uniprocessor nodes. In order to use... |

16 |
Foundations of practical parallel programming languages
- Snyder
- 1993
(Show Context)
Citation Context ...6]. KeLP's communication model combines structural abstraction with inspector/executor communication analysis as introduced in Multiblock PARTI[7]. In the Phase Abstractions programming model, Snyder =-=[13]-=- advocated separation of programs into levels corresponding to collective(Y) and node(X) levels. Control flow in multi-tier KeLP programming model extends Snyder's XYZ program levels to multi-tier mac... |

15 | A taxonomy of programming models for symmetric multiprocessors and SMP clusters, Proc. of the conference on Programming Models for Massively Parallel Computers
- Gropp, Lusk
- 1995
(Show Context)
Citation Context ... present, the programmer faces myriad options regarding the coordination of heavyweight processes, lightweight threads, shared memory, message-passing, synchronization, scheduling, and load balancing =-=[4]-=-. This daunting array of lowlevel programming detail hinders efficient implementations for multi-tier platforms. ? Stephen Fink was supported by the DOE Computational Science Graduate Fellowship Progr... |

14 |
Hierarchical Programming for Block–Structured Scientific Calculations
- Fink
- 1998
(Show Context)
Citation Context ...plication with a simple domain decomposition. To overlap communication and computation, we developed a multi-tier pipelined version of the SUMMA algorithm, which will be described in detail elsewhere =-=[10]-=-. The third application is the NAS-FT benchmark, which solves a 3D diffusion equation using Fast Fourier Transform. We obtained MPI code for FT from the NAS Parallel Benchmarks v2.1 [11]. The multi-ti... |

13 |
An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark
- Agarwal, Gustavson, et al.
- 1994
(Show Context)
Citation Context ...ons add a second level of parallelism to node-level kernels with domain decomposition. To overlap communication and computation, we pipeline the FFTs across iterations with the algorithm described in =-=[12]-=-. Fig. 3 reports performance of these codes, scaling the problem size with the number of nodes. The results show that on eight SMP nodes, the multi-tier KeLP code without overlap outperforms the MPI c... |

13 | A general programming model for developing scalable ocean circulation applications
- Sawdey, O’Keefe, et al.
- 1996
(Show Context)
Citation Context ...perhaps the first language to incorporate two levels of parallelism in order to match a hierarchical parallel architecture. Some more recent systems explicitly target SMP clusters. Sawdey and O'Keefe =-=[16]-=- have applied the Fortran-P programming model to grid-based applications on SMP clusters. In Fortran-P, a compiler translates serial grid-based code to explicitly threaded parallel code. KeLP 2.0 pres... |

11 |
Cedar Fortran and Its Compiler
- Eigenmann, Hoeflinger, et al.
- 1990
(Show Context)
Citation Context ...tends Snyder's XYZ program levels to multi-tier machines. Alpern, Carter, and Ferrante's PMH model [14] provides an elegant framework for multi-tier parallel architectures. The Cedar Fortran language =-=[15]-=- was perhaps the first language to incorporate two levels of parallelism in order to match a hierarchical parallel architecture. Some more recent systems explicitly target SMP clusters. Sawdey and O'K... |

11 | Minimizing overhead in parallel algorithms through overlapping communication/computation - Somani, Sansano - 1997 |

2 | pSather:layered extentions to an object-oriented language for efficient parallel computation - Murer, Feldman, et al. - 1993 |