Results 1  10
of
41
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract

Cited by 65 (4 self)
 Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
A ThreeDimensional Approach to Parallel Matrix Multiplication
 IBM Journal of Research and Development
, 1995
"... A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor performs a sin ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor performs a single local matrix multiplication of size M=p 1 \Theta N=p 2 \Theta K=p 3 . Before the local computation can be carried out, each subcube must receive a single submatrix of A and B. After the single matrix multiplication has completed, K=p 3 submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor P 1=6 less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallel TM SP2 TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winog...
The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers
 SIAM J. Sci. Comput
, 1997
"... Abstract. The implementation and performance of a class of divideandconquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divideandconqu ..."
Abstract

Cited by 31 (11 self)
 Add to MetaCart
Abstract. The implementation and performance of a class of divideandconquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divideandconquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many floating point operations as the best serial QR algorithm, it can be simply constructed from a small set of highly parallelizable matrix building blocks within Level 3 basic linear algebra subroutines (BLAS). Efficient implementations of these building blocks are available on a wide range of machines. In some illconditioned cases, the algorithm may lose numerical stability, but this can easily be detected and compensated for. The algorithm reached 31 % efficiency with respect to the underlying PUMMA matrix multiplication and 82 % efficiency with respect to the underlying ScaLAPACK matrix inversion on a 256 processor Intel Touchstone Delta system, and 41 % efficiency with respect to the matrix multiplication in CMSSL on a 32 node Thinking Machines CM5 with vector units. Our performance model predicts the performance reasonably accurately. To take advantage of the geometric nature of SDC algorithms, we have designed a graphical user interface to let the user choose the spectral decomposition according to specified regions in the complex plane.
Optimal Multicast Communication in WormholeRouted Torus Networks
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... : This paper presents efficient algorithms that implement onetomany, or multicast, communication in wormholerouted torus networks. By exploiting the properties of the switching technology and the use of virtual channels, a minimumtime multicast algorithm is presented for ndimensional torus net ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
: This paper presents efficient algorithms that implement onetomany, or multicast, communication in wormholerouted torus networks. By exploiting the properties of the switching technology and the use of virtual channels, a minimumtime multicast algorithm is presented for ndimensional torus networks that use deterministic, dimensionordered routing of unicast messages. The algorithm can deliver a multicast message to m \Gamma 1 destinations in dlog 2 me messagepassing steps, while avoiding contention among the constituent unicast messages. Performance results of a simulation study on torus networks are also given. 1 Introduction The recent trend in supercomputer design has been towards scalable parallel computers, which are designed to offer corresponding gains in performance as the number of processors is increased. Many such systems, known as massively parallel computers (MPCs), are characterized by the distribution of memory among an ensemble of processing nodes; nodes commu...
The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
, 1994
"... This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standa ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon systems.
Matrix Multiplication On The OTISMesh Optoelectronic Computer
 In Proceedings of the sixth international conference on Massively Parallel Processing using Optical Interconnections (MPPOI’99
, 2001
"... We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTISMesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTISMesh are considered and the relative merits of each compared. We show that our algorithms to mul ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
We develop algorithms to multiply two vectors, a vector and a matrix, and two matrices on an OTISMesh optoelectronic computer. Two mappings, group row and group submesh [25], of a matrix onto an OTISMesh are considered and the relative merits of each compared. We show that our algorithms to multiply a column and row vector use an optimal number of data moves for both the group row and group submesh mappings; our algorithm to multiply a row vector and a column vector is optimal for the group row mapping; and our algorithm to multiply a matrix by a column vector is optimal for the group row mapping.
A framework for adaptive algorithm selection in STAPL
 IN PROC. ACM SIGPLAN SYMP. PRIN. PRAC. PAR. PROG. (PPOPP), PP 277–288
, 2005
"... Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distr ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at runtime. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines runtime tests that correctly select the best performing algorithm from among several competing algorithmic options in 86100 % of the cases studied, depending on the operation and the system.
Comparison of Scalable Parallel Matrix Multiplication Libraries
 in Proceedings of the Scalable Parallel Libraries Conference, Starksville, MS
, 1993
"... This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as we ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2D torus wrap. The algorithmic differences resulting from these different layouts are discussed as well as the general issues associated with different data layouts for library routines. Results on the Intel Delta for the two matrix multiplication algorithms are presented. 1. Introduction Matrix multiplication is a standard algorithm that is an important computational kernel in many applications including eigensolvers [3] and LU factorization [15]. Utilizing matrix multiplication is one of the principal ways of achieving high efficiency block algorithms in packages such as LAPACK [2]. The BLAS 3 routines were added to achieve this block performance on computers, and optimized versions are available on most serial machines [10]. For matrix multiplication, the BLAS 3 routine XGEMM is availa...