## GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark (1998)

### Cached

### Download Links

- [www.cs.umu.se]
- [www.netlib.org]
- [www.cs.utk.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE |

Citations: | 86 - 8 self |

### BibTeX

@ARTICLE{Kågström98gemm-basedlevel,

author = {Bo Kågström and Per Ling and Charles Van Loan},

title = {GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark},

journal = {ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE},

year = {1998},

volume = {24},

number = {3},

pages = {268--302}

}

### Years of Citing Articles

### OpenURL

### Abstract

The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.

### Citations

742 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...o perform matrix-matrix (level 3) operations in their inner loops (e.g., see [12]). Typically, these matrix-matrix operations are expressed as calls to level 3 Basic Linear Algebra Subprograms (BLAS) =-=[9, 10]-=-, which together with level 1 BLAS [22] and level 2 BLAS [7] are de facto standards for basic matrix and vector operations. The level 3 BLAS have been successfully used as building blocks for several ... |

536 |
Basic linear algebra subprograms for fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ...ns in their inner loops (e.g., see [12]). Typically, these matrix-matrix operations are expressed as calls to level 3 Basic Linear Algebra Subprograms (BLAS) [9, 10], which together with level 1 BLAS =-=[22]-=- and level 2 BLAS [7] are de facto standards for basic matrix and vector operations. The level 3 BLAS have been successfully used as building blocks for several applications, including the software li... |

446 | An extended set of fortran basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1988
(Show Context)
Citation Context ...s (e.g., see [12]). Typically, these matrix-matrix operations are expressed as calls to level 3 Basic Linear Algebra Subprograms (BLAS) [9, 10], which together with level 1 BLAS [22] and level 2 BLAS =-=[7]-=- are de facto standards for basic matrix and vector operations. The level 3 BLAS have been successfully used as building blocks for several applications, including the software library LAPACK [3]. Wit... |

422 |
LAPACK User's Guide
- Anderson, Bai, et al.
- 1992
(Show Context)
Citation Context ...BLAS [7] are de facto standards for basic matrix and vector operations. The level 3 BLAS have been successfully used as building blocks for several applications, including the software library LAPACK =-=[3]. Wit-=-h a highly optimized level 3 BLAS, most of the LAPACK codes will \automatically" peform well. However, due to the complex hardware organization of advanced computer architectures it can be very c... |

375 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ... by using parallel versions of the underlying routines. It is also possible to create a level 3 BLAS library based on fast algorithms for the GEMM operation, e.g., Strassen's or Winograd's algorithms =-=[25, 26, 15, 11]-=-. Our contribution is two-fold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS, which are structured to e ectively reduce data tra c in a memory hierarchy. Second, the G... |

99 | Algorithm 679: a set of level 3 basic linear algebra subprograms - Dongarra, Cruz, et al. - 1990 |

69 |
Impact of hierarchical memory systems on linear algebra algorithm design
- Gallivan, Jalby, et al.
- 1988
(Show Context)
Citation Context ...is increasingly dominated by the cost of computation. This fact has lead to a technique to reorganize standard algorithms to perform matrix-matrix (level 3) operations in their inner loops (e.g., see =-=[12]-=-). Typically, these matrix-matrix operations are expressed as calls to level 3 Basic Linear Algebra Subprograms (BLAS) [9, 10], which together with level 1 BLAS [22] and level 2 BLAS [7] are de facto ... |

52 | Exploiting fast matrix multiplication within the Level 3
- Higham
- 1990
(Show Context)
Citation Context ... by using parallel versions of the underlying routines. It is also possible to create a level 3 BLAS library based on fast algorithms for the GEMM operation, e.g., Strassen's or Winograd's algorithms =-=[25, 26, 15, 11]-=-. Our contribution is two-fold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS, which are structured to e ectively reduce data tra c in a memory hierarchy. Second, the G... |

48 |
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
- Agarwal, Gustavson, et al.
- 1994
(Show Context)
Citation Context ...people have put a lot of e ort into developing fast level 3 BLAS since the speci - cation was published in 1990 [9, 10]. Some vendors provide highly optimized BLAS for their machines, see for example =-=[2, 1,16,4,24]-=-, while others provide optimized versions of some or none of the routines. Vendor-independent groups have also developed tuned level 3 kernels for di erent machines, for example [23, 17,13,6,14], wher... |

31 | Gemmw: A portable level 3 blas winograd variant of strassen’s matrix-matrix multiply algorithm
- Douglas, Heroux, et al.
- 1994
(Show Context)
Citation Context ... by using parallel versions of the underlying routines. It is also possible to create a level 3 BLAS library based on fast algorithms for the GEMM operation, e.g., Strassen's or Winograd's algorithms =-=[25, 26, 15, 11]-=-. Our contribution is two-fold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS, which are structured to e ectively reduce data tra c in a memory hierarchy. Second, the G... |

20 |
Improving performance of linear algebra algorithms for dense matrices using algorithmic prefetch
- Agarwal, Gustavson, et al.
- 1994
(Show Context)
Citation Context ...people have put a lot of e ort into developing fast level 3 BLAS since the speci - cation was published in 1990 [9, 10]. Some vendors provide highly optimized BLAS for their machines, see for example =-=[2, 1,16,4,24]-=-, while others provide optimized versions of some or none of the routines. Vendor-independent groups have also developed tuned level 3 kernels for di erent machines, for example [23, 17,13,6,14], wher... |

15 |
Portable High Performance GEMM-based Level 3
- Kagstrom, Ling, et al.
- 1993
(Show Context)
Citation Context ... the GEMM--based level 3 BLAS benchmark, which is a tool for performance evaluation of different level 3 BLAS implementations. Some early results from the model implementations have been published in =-=[8, 9]-=-. In this contribution (talk) we will discuss design principles for the model implementations and present new performance results for different architectures (vector as well as RISC--based), including... |

11 |
Distribution of Mathematical Software by Electronic Mail
- Dongarra, Grosse
- 1987
(Show Context)
Citation Context ...d installation The source code for the model implementations and for the benchmark program comes in single, double, complex, and double complex precision data types and is freely available via netlib =-=[3]-=-. All routines are written in Fortran 77 for portability. No changes should be necessary to run the programs correctly on different target machines. Send an e--mail with the following message send ind... |

10 |
High performance GEMM-based level-3 BLAS: sample routines for double precision real data
- K˚agström, Ling, et al.
- 1991
(Show Context)
Citation Context ...a, C, ldc ) TRSM ( side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc ) 3 GEMM{Based Level 3 BLAS Concept We have shown that \one can live with" just one highly optimized level 3 BLAS routine: =-=GEMM [21, 17]-=-. This subprogram oversees a general matrix multiply of the form C op(A)op(B)+ C; where op(X) denotes X or X T : The structured matrix multiplication problems handled by the other level 3 BLAS can be ... |

10 |
Engineering and Scientific Subroutine Library, Guide and Reference. 1st Ed. (Program Number
- IBM
- 1986
(Show Context)
Citation Context ... people have put a lot of effort into developing fast level 3 BLAS since the specification was published in 1990 [4, 5]. Some vendors provide highly optimized BLAS for their machines, see for example =-=[7, 2]-=-, while others provide optimized versions of some or none of the routines. Independent groups have also developed tuned level 3 kernels for different machines, for example [14, 6]. Today different imp... |

9 | ªA Parallel Block Implementation of Level 3 - Dayde, Duff, et al. - 1994 |

8 | C.: GEMM-Basd Level3 BLAS - Kagstrom, Ling, et al. - 1998 |

6 |
GEMM-based level 3 BLAS: Algorithms for the model implementations
- K˚agström, Ling, et al.
- 1999
(Show Context)
Citation Context ...nd are structured to e ectively reduce data tra c in a memory hierarchy. A detailed description of the algorithms used in our model implementations for the di erent level 3 operations is presented in =-=[19]-=-. These descriptions include block partitionings and associated GEMM-based templates for di erent options of the operations. Since these descriptions are very spacedemanding we only give a brief descr... |

6 |
Some Remarks on Fast Multiplication of Polynomials
- Winograd
- 1973
(Show Context)
Citation Context |

5 |
A Set of High Performance Level-3 BLAS Structured and Tuned for the
- Ling
- 1990
(Show Context)
Citation Context ...ines, see for example [7, 2], while others provide optimized versions of some or none of the routines. Independent groups have also developed tuned level 3 kernels for different machines, for example =-=[14, 6]-=-. Today different implementations with different performance characteristics coexist and it is becoming more important to evaluate different implementations thoroughly. The GEMM--based benchmark measu... |

2 |
Engineering and Scienti Subroutine Library, Guide and Reference, Release 3. Fourth Edition (Program Number 5668-863
- IBM
- 1988
(Show Context)
Citation Context ...people have put a lot of e ort into developing fast level 3 BLAS since the speci - cation was published in 1990 [9, 10]. Some vendors provide highly optimized BLAS for their machines, see for example =-=[2, 1,16,4,24]-=-, while others provide optimized versions of some or none of the routines. Vendor-independent groups have also developed tuned level 3 kernels for di erent machines, for example [23, 17,13,6,14], wher... |

2 |
Portable High Performance GEMM-- Based Level 3
- Kagstrom, Ling, et al.
- 1993
(Show Context)
Citation Context ...tine) Perf (user-speci ed level 3 routine) : For a vendor-supplied level 3 BLAS library we would expect to have all GEMM-ratios less than 1. However, this is not always the case (e.g., see results in =-=[17, 18] and -=-Section 7). A value greater than one implies that the GEMM-based implementation is faster than the user-speci ed implementation for the given problem con guration. The collected \mean value" stat... |

1 |
Paragon Basic Math Library Performance Report
- Corporation
- 1993
(Show Context)
Citation Context |

1 |
Design Issues and the Performance of Level 1 and Level 2 Kernels on Intel i860-based Platforms
- Dackland
- 1995
(Show Context)
Citation Context ...PARAGON are displayed in Table 11. The underlying routines of the GEMM-based library are from lkmath, except for DGEMV for which we use an optimized assembler version (denoted KD-DGEMV in the tables) =-=[5]-=-. This routine stores parts of A in cache memory and thereby makes it possible to attain level 3 performance of consecutive GEMV operations where the A-block iskept x but x is varied (see Section 4.4.... |

1 |
di Brozolo
- Dongarra, Mayes, et al.
- 1991
(Show Context)
Citation Context ...ized DGEMM routine for the machine (developed by Bernhard Przywara and denoted BP-DGEMM in the tables) which enabled us to try out the GEMM-based model implementations. BP-DGEMM builds on the work in =-=[8]-=-. The remaining underlying routines of the GEMM-based library are from the original level 1 and 2 BLAS model implementations. In Table 13 we show the GEMM-ratios from a single node of Parsytec GC/Powe... |

1 |
Optimization of Level 3 BLAS for SIEMENS VP Systems
- Grasemann
- 1989
(Show Context)
Citation Context ...example [2, 1,16,4,24], while others provide optimized versions of some or none of the routines. Vendor-independent groups have also developed tuned level 3 kernels for di erent machines, for example =-=[23, 17,13,6,14]-=-, where some are based on the GEMM-based concept [17, 6,14]. Today di erent implementations with di erent performance characteristics coexist and it is becoming more important toevaluate di erent impl... |

1 |
Performance Level 3 BLAS. A KSR Implementation. Working note
- High
- 1994
(Show Context)
Citation Context ...example [2, 1,16,4,24], while others provide optimized versions of some or none of the routines. Vendor-independent groups have also developed tuned level 3 kernels for di erent machines, for example =-=[23, 17,13,6,14]-=-, where some are based on the GEMM-based concept [17, 6,14]. Today di erent implementations with di erent performance characteristics coexist and it is becoming more important toevaluate di erent impl... |

1 |
ALGORITHM XYZ. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
- Kagstrom, Ling, et al.
- 1995
(Show Context)
Citation Context ...nciples behind and the model implementations, and the performance evaluation benchmark. Moreover, we report results from extensive testings on several high-performance platforms. In a companion paper =-=[20]-=-, we describe the installation and tuning of the GEMM-based model implementations, and the use and installation of the performance evaluation benchmark. Before we gointo any further details we outline... |

1 |
A Set of High Performance
- Ling
- 1993
(Show Context)
Citation Context ...ge format for the matrix multiply operation. T1 does not need to t in the cache. With this approach for SYR2K all handling of the memory hierarchy becomes local to GEMM. This approach was rst used in =-=[23]-=-. Notice also that no level 2 BLAS is used in this implementation. The characteristics of the GEMM-based SYR2K is summarized in Table 6. 16sTable 6: Characteristics of the SYR2K implementation. 5.4 Tr... |

1 |
ALGORITHM XYZ. GEMM--Based Level--3 BLAS: A Portable, High--Performance Model Implementation and Benchmark
- Kagstrom, Ling, et al.
- 1995
(Show Context)
Citation Context ...Department of Computer Science, Cornell University, Ithaca, New York 14853-7501 We are in the process of finishing a two-part paper that will be submitted to ACM Transactions on Mathematical Software =-=[12, 11]-=-. The papers present portable and high--performance model implementations of the GEMM--based level 3 BLAS in Fortran 77 and the GEMM--based level 3 BLAS benchmark, which is a tool for performance eval... |

1 |
GEMM--Based Level--3 BLAS: A Portable, High-- Performance Model Implementation and Benchmark
- Kagstrom, Ling, et al.
- 1995
(Show Context)
Citation Context ...Department of Computer Science, Cornell University, Ithaca, New York 14853-7501 We are in the process of finishing a two-part paper that will be submitted to ACM Transactions on Mathematical Software =-=[12, 11]-=-. The papers present portable and high--performance model implementations of the GEMM--based level 3 BLAS in Fortran 77 and the GEMM--based level 3 BLAS benchmark, which is a tool for performance eval... |