## A Three-Dimensional Approach to Parallel Matrix Multiplication (1995)

Venue: | IBM Journal of Research and Development |

Citations: | 39 - 0 self |

### BibTeX

@ARTICLE{Agarwal95athree-dimensional,

author = {R.C. Agarwal and S. M. Balle and F. G. Gustavson and M. Joshi and P. Palkar},

title = {A Three-Dimensional Approach to Parallel Matrix Multiplication},

journal = {IBM Journal of Research and Development},

year = {1995},

volume = {39},

pages = {39--5}

}

### Years of Citing Articles

### OpenURL

### Abstract

A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensions---M , N , and K. Each processor performs a single local matrix multiplication of size M=p 1 \Theta N=p 2 \Theta K=p 3 . Before the local computation can be carried out, each subcube must receive a single submatrix of A and B. After the single matrix multiplication has completed, K=p 3 submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor P 1=6 less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallel TM SP2 TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winog...

### Citations

373 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ...1). 2.1 Combining Strassen's algorithm with the 3D P GEMM algorithm A straightforward variation of the 3D algorithm allows the use of an O(n 2:81 ) matrix multiplication algorithm devised by Strassen =-=[22]-=-. Our approach is to use Winograd's variant of Strassen's algorithm to perform the local computation instead of using GEMM. In Step 4., we replace the single call to GEMM with a call to GEMMS [17]. 3 ... |

95 |
Communication complexity of PRAMs
- Aggarwal, Chandra, et al.
- 1990
(Show Context)
Citation Context ...nd van der Vorst [7], by Choi, Dongarra, and Walker [6], by Huss-Lederman, Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir =-=[3]-=- show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson, and Edelman [14] discuss 3D and other algorithms for boolean cubes and hypercubes. Gupta and Kumar [13] ... |

66 |
der Vorst, Parallel numerical linear algebra, in: Acta Numerica
- Demmel, Heath, et al.
- 1993
(Show Context)
Citation Context ... performance than the 2D ScaLAPACK PDGEMM algorithm [18]. The literature describing matrix multiplication algorithms is very extensive. Some descriptions are given by Demmel, Heath, and van der Vorst =-=[7]-=-, by Choi, Dongarra, and Walker [6], by Huss-Lederman, Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D t... |

65 | SUMMA: Scalable Universal Matrix Multiplication Algorithm
- Geijn, Watts
- 1995
(Show Context)
Citation Context ...ions are given by Demmel, Heath, and van der Vorst [7], by Choi, Dongarra, and Walker [6], by Huss-Lederman, Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts =-=[23]-=-. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson, and Edelman [14] discuss 3D and other algorithms for boolean cubes and ... |

59 | PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers. Concurrency: Practice and Experience
- Choi, Dongarra, et al.
- 1994
(Show Context)
Citation Context ...PDGEMM algorithm [18]. The literature describing matrix multiplication algorithms is very extensive. Some descriptions are given by Demmel, Heath, and van der Vorst [7], by Choi, Dongarra, and Walker =-=[6]-=-, by Huss-Lederman, Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPR... |

43 |
A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication
- Agarwal, Gustavson, et al.
- 1994
(Show Context)
Citation Context ...is very extensive. Some descriptions are given by Demmel, Heath, and van der Vorst [7], by Choi, Dongarra, and Walker [6], by Huss-Lederman, Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair =-=[2]-=-, by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson, and Edelman [14] discuss 3D and other ... |

32 | Extra high speed matrix multiplication on the Cray-2
- Bailey
- 1988
(Show Context)
Citation Context ...iency [17]. It is also possible to use Strassen's algorithm on the global matrices down to a level where the matrices fit into the local memory of the node, as described by Agarwal at al. [1]. Bailey =-=[4]-=-, Grayson, Shah, and van de Geijn [11], Balle [5], and Douglas et al. [8] describe 2D implementations of Strassen's method. In Section 2, we outline the 3D algorithm and its Strassen variation. Sectio... |

31 | Gemmw: A portable level 3 blas winograd variant of strassenâ€™s matrix-matrix multiply algorithm
- Douglas, Heroux, et al.
- 1994
(Show Context)
Citation Context ... matrices down to a level where the matrices fit into the local memory of the node, as described by Agarwal at al. [1]. Bailey [4], Grayson, Shah, and van de Geijn [11], Balle [5], and Douglas et al. =-=[8]-=- describe 2D implementations of Strassen's method. In Section 2, we outline the 3D algorithm and its Strassen variation. Section 3 also demonstrates that the 3D approach yields very high performance o... |

21 | Scalability of parallel algorithms for matrix multiplication
- Gupta, Kumar
- 1993
(Show Context)
Citation Context ...r [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson, and Edelman [14] discuss 3D and other algorithms for boolean cubes and hypercubes. Gupta and Kumar =-=[13]-=- discuss the scalability of many parallel matrix multiplication algorithms including 2D as well as 3D versions. Like other authors, they demonstrate that the communication ratio of 3D over 2D is P 1=6... |

17 | Matrix multiplication on hypercubes using full bandwidth and constant storage
- Ho, Johnsson
- 1991
(Show Context)
Citation Context ... Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson, and Edelman =-=[14]-=- discuss 3D and other algorithms for boolean cubes and hypercubes. Gupta and Kumar [13] discuss the scalability of many parallel matrix multiplication algorithms including 2D as well as 3D versions. L... |

17 | Comparison of Scalable Parallel Matrix Multiplication Libraries
- Huss-Lederman, Jacobson, et al.
- 1993
(Show Context)
Citation Context ...ribing matrix multiplication algorithms is very extensive. Some descriptions are given by Demmel, Heath, and van der Vorst [7], by Choi, Dongarra, and Walker [6], by Huss-Lederman, Jacobson, and Tsao =-=[15]-=-, by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho [20], and Ho, Johnsson,... |

14 |
de Geijn. A high performance parallel Strassen implementation
- Grayson, Shah, et al.
- 1995
(Show Context)
Citation Context ...e Strassen's algorithm on the global matrices down to a level where the matrices fit into the local memory of the node, as described by Agarwal at al. [1]. Bailey [4], Grayson, Shah, and van de Geijn =-=[11]-=-, Balle [5], and Douglas et al. [8] describe 2D implementations of Strassen's method. In Section 2, we outline the 3D algorithm and its Strassen variation. Section 3 also demonstrates that the 3D appr... |

10 |
Engineering and Scientific Subroutine Library, Guide and Reference. 1st Ed. (Program Number
- IBM
- 1986
(Show Context)
Citation Context ...EMM 1 algorithm based on a three-dimensional approach is presented. For the parallel case, the algorithm is a natural generalization of the serial GEMM routine. 1 The symbol stands for S, D, C, and Z =-=[17, 18]-=-, i.e., Single, Double, Complex single, and complex double (Z) precision. GEMM computes C = fi C + ff op(A) op(B) where ff, fi are scalars, A, B, and C are matrices, and op(X) stands for X , X T , or ... |

10 |
Algorithms for multiplying matrices of arbitrary shapes using shared memory primitives on a Boolean cube
- Johnsson, Ho
- 1987
(Show Context)
Citation Context ...Jacobson, and Tsao [15], by Agarwal, Gustavson, and Zubair [2], by van der Geijn and Watts [23]. Aggarwal, Chandra, and Snir [3] show that a 3D type algorithm is optimal for an LPRAM. Johnsson and Ho =-=[20]-=-, and Ho, Johnsson, and Edelman [14] discuss 3D and other algorithms for boolean cubes and hypercubes. Gupta and Kumar [13] discuss the scalability of many parallel matrix multiplication algorithms in... |

7 |
Engineering and Scientific Subroutine Library for AIX Guide and Reference, 3.3 ed: IBM
- IBM
- 2001
(Show Context)
Citation Context ...EMM 1 algorithm based on a three-dimensional approach is presented. For the parallel case, the algorithm is a natural generalization of the serial GEMM routine. 1 The symbol stands for S, D, C, and Z =-=[17, 18]-=-, i.e., Single, Double, Complex single, and complex double (Z) precision. GEMM computes C = fi C + ff op(A) op(B) where ff, fi are scalars, A, B, and C are matrices, and op(X) stands for X , X T , or ... |

3 |
Parallel Environment: Parallel Programming Subroutine Reference
- AIX
- 1994
(Show Context)
Citation Context ... primitives: all-gather and all-to-all [9, 12]. For the performance studies presented in Section 3, we used the equivalent MPL (Messag Passing Library) primitives mp concat and mp index, respectively =-=[16]-=-. In the following, we define P to be the total number of processors, and p 1 , p 2 , and p 3 to be the number of processors in the d 1 , d 2 , and d 3 direction, respectively---thereby having P = p 1... |

2 |
Distributed-memory matrix computations
- Balle
- 1995
(Show Context)
Citation Context ... algorithm on the global matrices down to a level where the matrices fit into the local memory of the node, as described by Agarwal at al. [1]. Bailey [4], Grayson, Shah, and van de Geijn [11], Balle =-=[5]-=-, and Douglas et al. [8] describe 2D implementations of Strassen's method. In Section 2, we outline the 3D algorithm and its Strassen variation. Section 3 also demonstrates that the 3D approach yields... |

1 |
Using MPI: Portable Parallel Pragramming with the message passing interface
- Gropp, Lusk, et al.
- 1994
(Show Context)
Citation Context ...y, perform the matrix additions of Equation 2. The communication part of the algorithm is done by simultaneously making calls to the MPI collective communication primitives: all-gather and all-to-all =-=[9, 12]-=-. For the performance studies presented in Section 3, we used the equivalent MPL (Messag Passing Library) primitives mp concat and mp index, respectively [16]. In the following, we define P to be the ... |

1 |
Scalable parallel computing
- IBM
- 1995
(Show Context)
Citation Context ...GEMMS [17]. 3 Performance Results Performance results for the parallel 3D matrix multiplication implementation are presented. These experiments were carried out on IBM Powerparallel TM SP2 TM systems =-=[19]-=-. MPL message-passing subroutines are used as communication primitives [16]. Figures 5 and 6 show performance for the 3D parallel matrix multiplication of the matrix C = C + AB for PDGEMM and PZGEMM o... |

1 |
On the matrix multiplication algorithms using message passing interface (mpi). Unpublished note
- Lemmerling, Vanhamme, et al.
- 1995
(Show Context)
Citation Context ...essage passing computers, our algorithm has the least amount of communication of all the 3D algorithms cited. It reduces the amount of communication by a factor 5=3 [13]. Lemmerling, Vanhamme, and Ho =-=[21]-=- describe several 1D, 2D, and some new 3D parallel algorithms. To the best of our knowledge, prior work has not addressed the problem of minimizing communication for matrices of arbitrary shape. In th... |