Results 1  10
of
44
Efficient Runtime Support for Irregular BlockStructured Applications
, 1998
"... Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular blockstructured scientific calculations without sacr ..."
Abstract

Cited by 44 (17 self)
 Add to MetaCart
Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular blockstructured scientific calculations without sacrificing performance. We have implemented these abstractions in KeLP, a C++ runtime library. KeLP's abstractions enable the programmer to express complicated communication patterns for dynamic applications, and to tune communication activity with a highlevel, abstract interface. We show that KeLP's flexible communication model effectively manages elaborate data motion patterns arising in structured adaptive mesh refinement, and achieves performance comparable to handcoded messagepassing on several structured numerical kernels. to appear in J. Parallel and Distributed Computing 1 Introduction Many scientific numerical methods employ structured irregular representations to improve accura...
A ThreeDimensional Approach to Parallel Matrix Multiplication
 IBM Journal of Research and Development
, 1995
"... A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor performs a sin ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor performs a single local matrix multiplication of size M=p 1 \Theta N=p 2 \Theta K=p 3 . Before the local computation can be carried out, each subcube must receive a single submatrix of A and B. After the single matrix multiplication has completed, K=p 3 submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor P 1=6 less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallel TM SP2 TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winog...
PLAPACK: Parallel Linear Algebra Package
, 1997
"... The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIl ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more applicationcentric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPIlike) style of programming. It is this style of programming that allows for highly compact codes, written in C but useable from FORTRAN, that more closely reflect the underlying blocked algorithms. We show that this can be attained without sacrificing high performance. 1 Introduction Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms, which typically can be described without filling up more than half a chalkboard, to a parallel code requires careful manipulation ...
Disk Resident Arrays: An ArrayOriented I/O Library for OutofCore Computations
, 1996
"... In outofcore computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However, the tools used to manage this storage are typically quite different from those used to manage access to local and remote me ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
In outofcore computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However, the tools used to manage this storage are typically quite different from those used to manage access to local and remote memory. This disparity complicates implementation of outofcore algorithms and hinders portability. We describe a programming model that addresses this problem. This model allows parallel programs to use essentially the same mechanisms to manage the movement of data between any two adjacent levels in a hierarchical memory system. We take as our starting point the Global Arrays sharedmemory model and library, which support a variety of operations on distributed arrays, including transfer between local and remote memories. We show how this model can be extended to support explicit transfer between global memory and secondary storage, and we define a Disk Resident Arrays library that supports s...
Communication Overlap in MultiTier Parallel Algorithms
 Proceedings of Supercomputing
, 1998
"... Hierarchically organized multicomputers such as SMP clusters offer new opportunities and new challenges for highperformance computation, but realizing their full potential remains a formidable task. We present a hierarchical model of communication targeted to blockstructured, bulksynchronous appl ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
Hierarchically organized multicomputers such as SMP clusters offer new opportunities and new challenges for highperformance computation, but realizing their full potential remains a formidable task. We present a hierarchical model of communication targeted to blockstructured, bulksynchronous applications running on dedicated clusters of symmetric multiprocessors. Our model supports nodelevel rather processorlevel communication as the fundamental operation, and is optimized for aggregate patterns of regular section moves rather than pointtopoint messages. These two capabilities work synergistically. They provide flexibility in overlapping communication and overcome deficiencies in the underlying communication layer on systems where internode communication bandwidth is at a premium. We have implemented our communication model in the KeLP2.0 run time library. We present empirical results for five applications running on a cluster of Digital AlphaServer 2100's. Four of the applicatio...
A Programming Methodology for Dualtier Multicomputers
 IEEE Transactions on Software Engineering
, 1999
"... Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. We pr ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
Hierarchicallyorganized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dualtier computers introduce many new degrees of freedom into the programmer 's performance model. We present a methodology for implementing blockstructured numerical applications on dualtier computers, and a runtime infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, runtime geometric metadata, and asynchronous collective communication. It effectively overlaps communication in cases where nonblocking pointtopoint message passing can fail to tolerate communication latency, either due to an incomplete implementation or because the pointtopoint model is inappropriate. KeLP's abstractions hide considerable detail without sacrificing performance, and dualtier applications written in KeLP...
Communicationoptimal parallel 2.5D matrix multiplication and LU factorization algorithms
"... One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p pr ..."
Abstract

Cited by 23 (16 self)
 Add to MetaCart
One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communicationavoiding pivoting, a stable alternative to partialpivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speedup over 3D matrix multiplication, due to a better balance between communication costs.
A framework for adaptive algorithm selection in STAPL
 IN PROC. ACM SIGPLAN SYMP. PRIN. PRAC. PAR. PROG. (PPOPP), PP 277–288
, 2005
"... Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distr ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at runtime. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines runtime tests that correctly select the best performing algorithm from among several competing algorithmic options in 86100 % of the cases studied, depending on the operation and the system.
Runtime Support for MultiTier Programming of BlockStructured Applications on SMP Clusters
 International Scientific Computing in ObjectOriented Parallel Environments Conference (ISCOPE ’97
, 1997
"... . We present a small set of programming abstractions to simplify efficient implementations for blockstructured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of p ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
. We present a small set of programming abstractions to simplify efficient implementations for blockstructured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of parallelism and locality. Additionally, to tolerate slow internode communication costs, KeLP 2.0 combines inspector /executor communication analysis with overlap of communication and computation. We illustrate how these programming abstractions hide the lowlevel details of thread management, scheduling, synchronization, and messagepassing, but allow the programmer to express efficient algorithms with intuitive geometric primitives. 1 Introduction Multitier parallel computers, such as clusters of symmetric multiprocessors (SMPs), have emerged as important platforms for highperformance computing [1]. A multitier computer, with several levels of locality and parallelism, presents a more c...