Results 1  10
of
11
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract

Cited by 66 (4 self)
 Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers
, 1993
"... 05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, ..."
Abstract

Cited by 60 (12 self)
 Add to MetaCart
05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, P. O. Box 3000, Boulder, CO 80307 137. Jurgen Steppeler, DWD, Frankfurterstr 135, 6050 Offenbach, WEST GERMANY 138. Rick Stevens, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 139. Paul N. Swarztrauber, National Center for Atmospheric Research, P. O. Box 3000, Boulder, CO 80307 140. Wei Pai Tang, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 141. Harold Trease, Los Alamos National Laboratory, Mail Stop B257, Los Alamos, NM 87545 142. Robert G. Voigt, ICASE, MS 132C, NASA Langley Research Center, Hampton, VA 23665 143. Mary F. Wheeler, Rice University, Department of Mathematical Sc
The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
, 1994
"... This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standa ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon systems.
Comparison of Scalable Parallel Matrix Multiplication Libraries
 in Proceedings of the Scalable Parallel Libraries Conference, Starksville, MS
, 1993
"... This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as we ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as well as the general issues associated with different data layouts for library routines. Results on the Intel Delta for the two matrix multiplication algorithms are presented. 1. Introduction Matrix multiplication is a standard algorithm that is an important computational kernel in many applications including eigensolvers [3] and LU factorization [15]. Utilizing matrix multiplication is one of the principal ways of achieving high efficiency block algorithms in packages such as LAPACK [2]. The BLAS 3 routines were added to achieve this block performance on computers, and optimized versions are available on most serial machines [10]. For matrix multiplication, the BLAS 3 routine XGEMM is availa...
Matrix Multiplication On The Intel Touchstone Delta
, 1993
"... . Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory messagepassing architecture with a twodimensional mesh topology. We obtain ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
. Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory messagepassing architecture with a twodimensional mesh topology. We obtain an implementation that uses communications primitives highly suited to the Delta and exploits the single node assemblycoded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86% with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 \Theta 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies. 1. Introduction Multiplication of two matrices is one of the most basic operations of scientific computing. Versions for serial computers h...
The PRISM Project: Infrastructure and Algorithms for Parallel Eigensolvers
, 1994
"... The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly revie ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly reviewing SYISDA, we discuss the algorithmic highlights of a distributedmemory implementation of this approach. These include a fast matrixmatrix multiplication algorithm, a new approach to parallel band reduction and tridiagonalization, and a harness for coordinating the divideandconquer parallelism in the problem. We also present performance results of these kernels as well as the overall SYISDA implementation on the Intel Touchstone Delta prototype. 1. Introduction Computation of eigenvalues and eigenvectors is an essential kernel in many applications, and several promising parallel algorithms have been investigated [29, 24, 3, 27, 21]. The work presented in this paper is part of the PRI...
A High Performance Parallel Strassen Implementation
 Parallel Processing Letters, Vol 6
, 1995
"... In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Resul ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 1020% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time. 1 Introduction In Strassen's algorithm, the total time complexity of the matrix multiplication is reduced by replacing it with smaller matrix multiplications together with a number of matrix additions, thereby reducing the operation count. A net reduction in execution time is attained only if the reduction in multiplications offsets the increase in additions. This requires the matrices to be relatively large before a net gain is observed. The advantage of using parallel architectures...
The Impact Of Hpf Data Layout On The Design Of Efficient And Maintainable Parallel Linear Algebra Libraries
"... In this document, we are concerned with the effects of data layouts for nonsquare processor meshes on the implementation of common dense linear algebra kernels such as matrixmatrix multiplication, LU factorizations, or eigenvalue solvers. In particular, we address ease of programming and tunability ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
In this document, we are concerned with the effects of data layouts for nonsquare processor meshes on the implementation of common dense linear algebra kernels such as matrixmatrix multiplication, LU factorizations, or eigenvalue solvers. In particular, we address ease of programming and tunability of the resulting software. We introduce a generalization of the torus wrap data layout that results in a decoupling of "local" and "global" data layout view. As a result, it allows for intuitive programming of linear algebra algorithms and for tuning of the algorithm for a particular mesh aspect ratio or machine characteristics. This layout is as simple as the proposed HPF layout but, in our opinion, enhances ease of programming as well as ease of performance tuning. We emphasize that we do not advocate that all users need be concerned with these issues. We do, however, believe, that for the foreseeable future "assembler coding" (as messagepassing code is likely to be viewed from a HPF pro...
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an objectoriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...
Experiences of Parallelising Finiteelement Problems in a Functional Style
, 1995
"... this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the consider ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the considerable assistance provided by the simulator