## A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies (1995)

### Cached

### Download Links

- [www.cs.msstate.edu]
- [www.cs.msstate.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 10 - 0 self |

### BibTeX

@MISC{Li95apoly-algorithm,

author = {Jin Li and Anthony Skjellum and Robert D. Falgout},

title = {A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies},

year = {1995}

}

### OpenURL

### Abstract

In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on two-dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss a permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance...

### Citations

1138 | Using MPI: Portable Parallel Programming with the Message-Passing Interface - Lusk, Skjellum - 1994 |

780 |
A set of Level 3 Basic Linear Algebra Subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...istribution of matrix A and the row distribution of matrix B satisfy the permutation compatibility requirement, we can utilize the optimal sequential assembly-coded version of xGEMM routines of BLAS (=-=Dongarra et al. 1990-=-) as local multiplication engine. Otherwise, we cannot use xGEMM routines. This can be resolved grossly by data remapping, but our point is not to resort to remapping within the classification scheme;... |

505 |
Introduction to Parallel Computing: Design and Analysis of Algorithms
- Kumar, Grama, et al.
- 1994
(Show Context)
Citation Context ... memory and an interconnection network that links processors through message interchange. Hence, the MIMD architectural model, which stands for multiple instruction streams and multiple data streams (=-=Kumar et al. 1994-=-), together with message-passing, forms an ideal parallel programming model suitable for multicomputers. This combined hardware, software architecture is capable of executing different programs concur... |

392 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ...rate their performance characteristics and the definite need for poly-algorithms. One of the algorithms with potential advantage for large-size dense matrix multiplication is the Strassen's algorithm =-=[24]-=-. There are two levels of application of Strassen's approach. One (low level) is to use the Strassen's approach to replace DGEMM as local multiplication engine [10]. The other (high-level) is to devel... |

213 |
Advanced C++ Programming Styles and Idioms
- Coplien
- 1992
(Show Context)
Citation Context ... clearly defined data types and operations. Since each object encapsulates the data structure as well as other information necessary to fully specify a decomposition, the traditional "fat interfa=-=ce" (Coplien 1992-=-), that is, a long argument list with OR'd functionality, of traditional Fortran implementations is avoided. Hence, the interface for applications is elegant and robust and subject to even greater run... |

142 |
Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering
- Foster
- 1995
(Show Context)
Citation Context ...llel computer systems effectively. One of the models of parallel computers is the multicomputer model. A multicomputer is a concurrent distributed-memory system with message-passing among processors (=-=Foster 1995-=-). A multicomputer consists of a number of von Neumann processors that have their own memory and an interconnection network that links processors through message interchange. Hence, the MIMD architect... |

137 |
A cellular computer to implement the Kalman filter algorithm
- Cannon
- 1969
(Show Context)
Citation Context ...ication algorithms in details. We classify the parallel matrix multiplication algorithms into three categories according to the communication primitives used. The first category is Cannon's approach (=-=Cannon 1969-=-), which shifts two of the matrices A, B, and C, while holding one stationary. We discuss three versions of Cannon's algorithm: the Cstationary, B-stationary, and A-stationary versions. The second cat... |

97 | Parallel matrix and graph algorithms - Dekel, Nassimi, et al. - 1981 |

78 | SUMMA: Scalable Universal Matrix Multiplication Algorithm. LAPACK Working Note 99, technical report
- Geijn, Watts
- 1995
(Show Context)
Citation Context ...d in [2] uses a different approach for parallel matrix multiplication. This algorithm only uses the broadcast communication primitive and tries to overlap the communication and the computation; SUMMA =-=[25]-=- evidently adopts the same idea with a slight difference in implementation. We present several algorithms that apply to the general case of rectangular matrix multiplication on rectangular process gri... |

64 | PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience - Choi, Walker, et al. - 1994 |

46 | A three-dimensional approach to parallel matrix multiplication
- Agarwal, Balle, et al.
- 1995
(Show Context)
Citation Context ...cation on rectangular process grids deserves further research. Another approach for parallel dense matrix multiplication is to use a threedimensional process topology (Dekel, Nassimi, and Sahni 1981; =-=Agarwal et al. 1995-=-). Agarwal et al. (1995) claimed that the advantage of the 3D approach is that it moves less data than the known 2D approach does and thus the 3D approach can achieve better performance. However, the ... |

44 | ªA High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel - Agarwal, Gustavson, et al. - 1994 |

36 |
An Initial Implementation of MPI
- Doss, Gropp, et al.
- 1993
(Show Context)
Citation Context ...Maui High Performance Computing Center 1995b). The program was compiled using the mpcc compiler with O3 optimization option. We used the MPICH 1.0.11 version of the Portable MPI Model Implementation (=-=Doss et al. 1993-=-) for message passing and the DGEMM subroutine of ESSL library for the local matrix multiplication. 74 The parallel matrix multiplication was of the form C = ffAB + fiC, where ff and fi were randomly ... |

32 | GEMMW: A portable level 3 BLAS Winograd variant of Strassen’s matrix–matrix multiply algorithm
- Douglas
- 1994
(Show Context)
Citation Context ... the Strassen algorithm (Strassen 1969). There are two levels of application of Strassen's approach. One (low level) is to use the Strassen approach to replace DGEMM as a local multiplication engine (=-=Douglas et al. 1994-=-). The other (high-level) is to develop a fully parallelized Strassen algorithm. The general parallel Strassen algorithm that can deal with general cases of rectangular matrix multiplication on rectan... |

29 |
Solving problems on concurrent processors, Volumn 1
- Fox, Johnson, et al.
- 1998
(Show Context)
Citation Context ... B (d) update and shift (c) update and shift Figure 2.1. The Cannon Algorithm 9 Another well-known parallel algorithm for multiplying square dense matrices on square process grids is Fox's algorithm (=-=Fox et al. 1988-=-). Fox's algorithm avoids the initial alignment of Cannon's algorithm by broadcasting sub-matrix A and circularly shifting matrix B. In each communication step, the selected sub-matrices of A are broa... |

28 | The Multicomputer Toolbox Approach to Concurrent BLAS and LACS
- Falgout, Skjellum, et al.
- 1992
(Show Context)
Citation Context ...mber of different data distributions will be supported without modification of source code. However, the underlying data distribution of matrices must satisfy the algorithmic compatibility condition (=-=Falgout et al. 1992-=-; 1993) to ensure the correctness of parallel matrix multiplication. This property defines restrictions on global and local data layouts of matrices involved in the multiplication process. Second, we ... |

23 | Multiplication of Matrices of Arbitrary Shapes on a Data Parallel Computer
- Mathur, Johnsson
- 1994
(Show Context)
Citation Context ...ix A-stationary version, the matrices B and C will be circularly shifted. Similarly, for the matrix B-stationary version, the matrices A and C will be circularly shifted. Mathur-Johnsson's algorithm (=-=Mathur and Johnsson 1994-=-), a systolic algorithm of the matrix C-stationary version, extends Cannon's algorithm to deal with arbitrary matrices and grids. When the shape of process grids is not square, the 1 For these two ver... |

21 | Writing Libraries in MPI
- Skjellum, Doss, et al.
(Show Context)
Citation Context ...e hierarchical structure of the objects. subroutine DGEMM is used as the local multiplication engine. Further consideration of software design merits separate discussion, which we offer separately in =-=[22, 23]-=- and elsewhere. 2 Permutation Compatible Data Distributions In this section, we present the definition of data distribution. We also discuss two instances of data distributions (i.e., (block) linear d... |

20 |
NAPSS - a numerical analysis problem solving system
- Rice, Rosen
- 1966
(Show Context)
Citation Context ...st have deep knowledge of underlying systems. The better approach to overcome this dilemma is to use the "poly-algorithmic approach." The poly-algorithmic approach, a concept introduced by J=-=ohn Rice (Rice and Rosen 1966-=-), refers to the use of two or more algorithms to solve the same problem with a high level decision-making process determining which of a set of algorithms performs best in a given situation. The poly... |

19 | The Multicomputer Toolbox: Scalable Parallel Libraries for Large-Scale Concurrent Applications
- Skjellum, Baldwin
- 1991
(Show Context)
Citation Context ...eeks to avoid such redistributions without further justification. 2 1.2 Design Methodology The design of parallel dense matrix multiplication algorithms is based on the Multicomputer Toolboxsapproach =-=[21]-=-. Two key ideas underlying scalable programming are logical process grids and data distribution independence [26, 27]. A logical process grid, denoted here by G P \ThetaQ , is a collection of processe... |

19 | Solving linear systems on vector and shared memory computers - Dongarra, Duff, et al. - 1991 |

17 | Comparison of Scalable Parallel Matrix Multiplication Libraries - Huss-Lederman, Jacobson, et al. - 1993 |

15 |
de Velde. Data Redistribution and Concurrency
- van
- 1990
(Show Context)
Citation Context ...s; MM 3 and MM 4 are completely new algorithms. The third category is the Broadcast-Broadcast approach and a new algorithm, BB, is detailed. All these algorithms use the data distribution independent =-=[26, 27]-=- approach so that the data distributions of matrices are flexible. However, the data distributions of matrices A, B, and C must satisfy the algorithmic compatibility [11, 12] requirement to ensure the... |

14 | Document for a standard message-passing interface
- Forum
- 1993
(Show Context)
Citation Context ... the interface for our parallel dense matrix multiplication library. The communication primitives, (i.e., broadcast, shift, align, and slide), are implemented using MPI message passing interface (MPI =-=Forum 1994-=-; Gropp, Lusk, and Skjellum 1994) point-to-point communication functions and collective 73 communication functions. The BLAS Basic Linear Algebra Subprograms (Dongarra et al. 1990) subroutine DGEMM is... |

13 | ªThe Data-Distribution-Independent Approach to Scalable Parallel Libraries,º master's thesis
- Bangalore
- 1995
(Show Context)
Citation Context ...s. This can be resolved grossly by data remapping, but our point is not to resort to remapping within the classification scheme; we wish to avoid temporaries and excess communication as mentioned in (=-=Bangalore 1995-=-). The Virtual Two-Dimensional Grid For a non-square grid GP \ThetaQ , we can view it as a ff \Theta ff square virtual grid (Huss-Lederman, Jacobson, and Tsao 1993; Huss-Lederman et al. 1994), where f... |

13 |
Concurrent Dynamic Simulation: Multicomputer Algorithms Research Applied to Ordinary Di erential-Algebraic
- Skjellum
- 1990
(Show Context)
Citation Context ...The processes are named (p; q) on this logical grid, where p (resp, q) ranges from 0 : : : P \Gamma 1 (resp, 0 : : : Q \Gamma 1). Such logical grids can be readily mapped to physical node topologies (=-=Skjellum 1990-=-). Data-distribution-independence (van de Velde and Lorenz 1989; van de Velde 1990) is another important design consideration for building scalable parallel libraries. The canonical parallel applicati... |

4 | Driving issues in scalable libraries: Poly-algorithms, data distribution independence, redistribution, local storage schemes
- Skjellum, Bangalore
- 1995
(Show Context)
Citation Context ...e hierarchical structure of the objects. subroutine DGEMM is used as the local multiplication engine. Further consideration of software design merits separate discussion, which we offer separately in =-=[22, 23]-=- and elsewhere. 2 Permutation Compatible Data Distributions In this section, we present the definition of data distribution. We also discuss two instances of data distributions (i.e., (block) linear d... |

4 |
Adaptive data distribution for concurrent continuation
- Velde, Lorenz
- 1989
(Show Context)
Citation Context ...s; MM 3 and MM 4 are completely new algorithms. The third category is the Broadcast-Broadcast approach and a new algorithm, BB, is detailed. All these algorithms use the data distribution independent =-=[26, 27]-=- approach so that the data distributions of matrices are flexible. However, the data distributions of matrices A, B, and C must satisfy the algorithmic compatibility [11, 12] requirement to ensure the... |

3 | Concurrent scientific computing - Velde - 1994 |

2 | Data redistribution and concurrency. Parallel Computing 16 - Velde - 1990 |

1 | Performance Computing Center. 1995a. SP parallel programming workshop: IBM SP hardware/software overview. http://www.mhpcc.edu/ training/workshop/html/ibmhwsw/ibmhwsw.html - High |

1 | Performance Computing Center. 1995b. SP parallel programming workshop: LoadLeveler. http://www.mhpcc.edu/training/workshop/html/ loadleveler/LoadLeveler.html - High |