Results 1  10
of
27
PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers
, 1993
"... 05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, ..."
Abstract

Cited by 59 (10 self)
 Add to MetaCart
05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, P. O. Box 3000, Boulder, CO 80307 137. Jurgen Steppeler, DWD, Frankfurterstr 135, 6050 Offenbach, WEST GERMANY 138. Rick Stevens, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 139. Paul N. Swarztrauber, National Center for Atmospheric Research, P. O. Box 3000, Boulder, CO 80307 140. Wei Pai Tang, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 141. Harold Trease, Los Alamos National Laboratory, Mail Stop B257, Los Alamos, NM 87545 142. Robert G. Voigt, ICASE, MS 132C, NASA Langley Research Center, Hampton, VA 23665 143. Mary F. Wheeler, Rice University, Department of Mathematical Sc
ScaLAPACK: A Linear Algebra Library for MessagePassing Computers
 In SIAM Conference on Parallel Processing
, 1997
"... This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discusse ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discussed. We present the different components and building blocks of ScaLAPACK and provide initial performance results for selected PBLAS routines and a subset of ScaLAPACK driver routines.
The Design and Evolution of Zipcode
 Parallel Computing
, 1994
"... Zipcode is a messagepassing and processmanagement system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and largescale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and iden ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
Zipcode is a messagepassing and processmanagement system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and largescale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and identified needs. Features of Zipcode that were originally unique to it, were its simultaneous support of static process groups, communication contexts, and virtual topologies, forming the "mailer" data structure. Pointtopoint and collective operations reference the underlying group, and use contexts to avoid mixing up messages. Recently, we have added "gathersend" and "receivescatter" semantics, based on persistent Zipcode "invoices," both as a means to simplify message passing, and as a means to reveal more potential runtime optimizations. Key features in Zipcode appear in the forthcoming MPI standard. Keywords: Static Process Groups, Contexts, Virtual Topologies, PointtoPoint Communica...
Comparison of Scalable Parallel Matrix Multiplication Libraries
 in Proceedings of the Scalable Parallel Libraries Conference, Starksville, MS
, 1993
"... This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as we ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as well as the general issues associated with different data layouts for library routines. Results on the Intel Delta for the two matrix multiplication algorithms are presented. 1. Introduction Matrix multiplication is a standard algorithm that is an important computational kernel in many applications including eigensolvers [3] and LU factorization [15]. Utilizing matrix multiplication is one of the principal ways of achieving high efficiency block algorithms in packages such as LAPACK [2]. The BLAS 3 routines were added to achieve this block performance on computers, and optimized versions are available on most serial machines [10]. For matrix multiplication, the BLAS 3 routine XGEMM is availa...
Matrix Multiplication On The Intel Touchstone Delta
, 1993
"... . Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory messagepassing architecture with a twodimensional mesh topology. We obtain ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
. Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory messagepassing architecture with a twodimensional mesh topology. We obtain an implementation that uses communications primitives highly suited to the Delta and exploits the single node assemblycoded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86% with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 \Theta 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies. 1. Introduction Multiplication of two matrices is one of the most basic operations of scientific computing. Versions for serial computers h...
The Multicomputer Toolbox  FirstGeneration Scalable Libraries
, 1993
"... "Firstgeneration" scalable parallel libraries have been achieved, and are maturing, within the Multicomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an interarchitecture ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
"Firstgeneration" scalable parallel libraries have been achieved, and are maturing, within the Multicomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an interarchitecture Makefile mechanism for building applications. We have devised Cbased strategies for useful classes of distributed data structures, including distributed matrices and vectors. The underlying Zipcodemessage passing system has enabled processgrid abstractions of multicomputers, communication contexts, and process groups, all characteristics needed for building scalable libraries, and scalable application software. We describe the datadistributionindependent approach to building scalable libraries, which is needed so that applications do not unnecessarily have to redistribute data at high expense. We discuss the strategy used for implementing datadistribution mappings. We also describe hig...
A PolyAlgorithm for Parallel Dense Matrix Multiplication on TwoDimensional Process Grid Topologies
, 1995
"... In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on twodimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coh ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on twodimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss a permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance...
The Multicomputer Toolbox: Current and Future Directions
 Proceedings of the Scalable Parallel Libraries Conference. IEEE Computer
, 1993
"... The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an objectoriented design; Cbas ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an objectoriented design; Cbased strategies for classes of distributed data structures (including distributed matrices and vectors) as well as uniform calling interfaces are defined. At a high level in the Toolbox, datadistributionindependence (DDI) support is provided. DDI is needed to build scalable libraries, so that applications do not have to redistribute data before calling libraries. Datadistributionindependent mapping functions implement this capability. Datadistributionindependent algorithms are sometimes more efficient than fixeddata distribution counterparts, because redistribution of data can be avoided. Underlying the system is a "performance and portability layer," which includes interfaces to sequent...
Dense and Iterative Concurrent Linear Algebra in the Multicomputer Toolbox
 in Proceedings of the Scalable Parallel Libraries Conference (SPLC '93
, 1993
"... The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We discuss uniform calling interfaces and functionality for linear algebra libraries. We include a detailed explanation of how the level3 dense LU factorization works, including features that support data distribution independence with a blocked algorithm. We illustrate the data motion for this algorithm, and for a representative iterative algorithm, PCGS. We conclude that data distribution independent libraries are feasible and highly desirable. Much work remains to be done in performance tuning of these algorithms, though good portability and applicationrelevance have already been achieved.