Results 1  10
of
42
GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract

Cited by 89 (8 self)
 Add to MetaCart
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and highperformance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMMbased level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMMbased level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMMbased model implementations.
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inpla ..."
Abstract

Cited by 76 (6 self)
 Add to MetaCart
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Using Strassen's Algorithm to Accelerate the Solution of Linear Systems
 J. Supercomputing
, 1991
"... Abstract. Strassen's algorithm for fast matrixmatrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY2 and CRAY YMP supercomputers. Several techniques have been usCd to reduce the scratch space requirement for this algorithm while simultaneously preserving a high l ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
Abstract. Strassen's algorithm for fast matrixmatrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY2 and CRAY YMP supercomputers. Several techniques have been usCd to reduce the scratch space requirement for this algorithm while simultaneously preserving a high level of performance. When the resulting Strassenbased matrix multiply routine is combined with some routines from the new LAPACK library, LV decomposition can be performed with rates significantly higher than those achieved by conventional
Stability of Block Algorithms with Fast Level 3 BLAS
 ACM Trans. Math. Soft
, 1992
"... . Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them wellsuited to highperformance computers. The numerical stability of the ..."
Abstract

Cited by 37 (15 self)
 Add to MetaCart
. Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them wellsuited to highperformance computers. The numerical stability of the block algorithms in the new linear algebra program library LAPACK is investigated here. It is shown that these algorithms have backward error analyses in which the backward error bounds are commensurate with the error bounds for the underlying level 3 BLAS (BLAS3). One implication is that the block algorithms are as stable as the corresponding point algorithms when conventional BLAS3 are used. A second implication is that the use of BLAS3 based on fast matrix multiplication techniques affects the stability only insofar as it increases the constant terms in the normwise backward error bounds. For linear equation solvers employing LU factorization it is shown that fixed precision iterative re...
Gemmw: A Portable Level 3 Blas Winograd Variant Of Strassen's MatrixMatrix Multiply Algorithm
, 1994
"... . Matrixmatrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines. ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
. Matrixmatrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines. For matrices where the coefficients are well conditioned, Winograd's variant of Strassen's algorithm offers some relief, but is rarely available in a quality form on most computers. We reconsider this method and offer a highly portable solution based on the Level 3 BLAS interface. Key Words. Level 3 BLAS, matrix multiplication, Winograd's variant of Strassen's algorithm, multilevel algorithms AMS(MOS) subject classification. Numerical Analysis: Numerical Linear Algebra 1. Preliminaries. Matrixmatrix multiplication is a very basic computer operation. A very clear description of how to do it can be found in many textbooks, e.g., [1]. Suppose we want to multiply two matrices A : M \Theta ...
Implementation of Strassen's Algorithm for Matrix Multiplication
 In Proceedings of Supercomputing '96
, 1996
"... In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix siz ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix sizes and shapes and the additional memory needed for temporary variables has been minimized. Replacing DGEMM with our routine should provide a significant performance gain for large matrices while providing the same performance for small matrices. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Our performance data reconfirms that Strassen's algorithm is practical for realistic size matrices. The usefulness of our implementation is demonstrated by replacing DGEMM with our routine in a large application code.
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm
 Appl. Math Letters
, 1990
"... In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
In this paper, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to highperformance parallel/vector codes for various architectures. In this paper, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray YMP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O(7 n ) for multiplying 2 n \Theta 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implem...
Fast linear algebra is stable
 In preparation
, 2006
"... In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of nbyn matrices can be done by any algorithm in O(n ω+η) operations for any η> 0, then it can be done stably in O(n ω+η) operations for any η> ..."
Abstract

Cited by 25 (15 self)
 Add to MetaCart
In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of nbyn matrices can be done by any algorithm in O(n ω+η) operations for any η> 0, then it can be done stably in O(n ω+η) operations for any η> 0. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in O(n ω+η) operations. 1
Stability of Methods for Matrix Inversion
, 1992
"... Inversion of a triangular matrix can be accomplished in several ways. The standard methods are characterised by the loop ordering, whether matrixvector multiplication, solution of a triangular system, or a rank1 update is done inside the outer loop, and whether the method is blocked or unblocked. ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
Inversion of a triangular matrix can be accomplished in several ways. The standard methods are characterised by the loop ordering, whether matrixvector multiplication, solution of a triangular system, or a rank1 update is done inside the outer loop, and whether the method is blocked or unblocked. The numerical stability properties of these methods are investigated. It is shown that unblocked methods satisfy pleasing bounds on the left or right residual. However, for one of the block methods it is necessary to convert a matrix multiplication into the solution of a multiple righthand side triangular system in order to have an acceptable residual bound. The inversion of a full matrix given a factorization PA = LU is also considered, including the special cases of symmetric inde nite and symmetric positive de nite matrices. Three popular methods are shown to possess satisfactory residual bounds, subject to a certain requirement on the implementation, and an attractive new method is described. This work was motivated by the question of what inversion methods should be used in LAPACK.
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n