Results 1  10
of
51
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
Applying recursion to serial and parallel QR factorization leads to better performance
"... this paper may be copied or distributed royalty free without further permission by computerbased and other informationservice systems. Permission to republish any other portion of this paper must be obtained from the Editor. ..."
Abstract

Cited by 51 (4 self)
 Add to MetaCart
this paper may be copied or distributed royalty free without further permission by computerbased and other informationservice systems. Permission to republish any other portion of this paper must be obtained from the Editor.
Stability of Block Algorithms with Fast Level 3 BLAS
 ACM Trans. Math. Soft
, 1992
"... . Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them wellsuited to highperformance computers. The numerical stability of the ..."
Abstract

Cited by 37 (15 self)
 Add to MetaCart
. Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them wellsuited to highperformance computers. The numerical stability of the block algorithms in the new linear algebra program library LAPACK is investigated here. It is shown that these algorithms have backward error analyses in which the backward error bounds are commensurate with the error bounds for the underlying level 3 BLAS (BLAS3). One implication is that the block algorithms are as stable as the corresponding point algorithms when conventional BLAS3 are used. A second implication is that the use of BLAS3 based on fast matrix multiplication techniques affects the stability only insofar as it increases the constant terms in the normwise backward error bounds. For linear equation solvers employing LU factorization it is shown that fixed precision iterative re...
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Tridiagonal, and Bidiagonal Form
, 1995
"... This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. These routines are important in the solution of eigenproblems. The paper focuses on how building blocks are used to create higherlevel library routines. Results are presented that demonstrate the scalability of the reduction routines. The most commonlyused building blocks used in ScaLAPACK are the sequential BLAS, the Parallel BLAS (PBLAS) and the Basic Linear Algebra Communication Subprograms (BLACS). Each of the matrix reduction algorithms consists of a series of steps in each of which one block column (or panel), and/or block row, of the matrix is reduced, followed by an update of the portion of the matrix that has not been factorized so far. This latter phase is performed usin...
Fast polar decomposition of an arbitrary matrix
 SIAM J. Sci. Stat. Comput
, 1990
"... Abstract. The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed usingaquadraticallyconvergentalgorithmofHigham SIAMJ. Sci. Statist. Comput.,7 (1986), pp. 11601174]. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the us ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
Abstract. The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed usingaquadraticallyconvergentalgorithmofHigham SIAMJ. Sci. Statist. Comput.,7 (1986), pp. 11601174]. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the use of a preliminary complete orthogonal decomposition, the algorithm can be extended to arbitrary A. The use ofthe algorithm to compute the positive semidefinite square root ofa Hermitian positive semidefinite matrix is also described. A hybrid algorithm that adaptively switches from the matrix inversion based iteration to a matrix multiplication based iteration due to Kovarik, and to Bj6rck and Bowie, is formulated. The decision when to switch is made using a condition estimator. This "matrix multiplication rich " algorithm is shown to be more efficient on machines for which matrix multiplication can be executed 1.5 times faster than matrix inversion.
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
, 1998
"... . We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the update ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
. We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m = n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m = n = 2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm. 1 Introduction LAPACK algorithm DGEQRF requires more floating point operations than LAPACK algorithm DGEQR2, see [1]. Yet, DGEQRF outperforms DGEQR2 on a RS/6000 workstation by nearly a factor of 3 on large matrices. Dongarra, Kaufm...
ScaLAPACK: A Linear Algebra Library for MessagePassing Computers
 In SIAM Conference on Parallel Processing
, 1997
"... This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discusse ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discussed. We present the different components and building blocks of ScaLAPACK and provide initial performance results for selected PBLAS routines and a subset of ScaLAPACK driver routines.
Computing RankRevealing QR Factorizations of Dense Matrices
 Argonne Preprint ANLMCSP5590196, Argonne National Laboratory
, 1996
"... this paper, and we give only a brief synopsis here. For details, the reader is referred to the code. Test matrices 1 through 5 were designed to exercise column pivoting. Matrix 6 was designed to test the behavior of the condition estimation in the presence of clusters for the smallest singular value ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
this paper, and we give only a brief synopsis here. For details, the reader is referred to the code. Test matrices 1 through 5 were designed to exercise column pivoting. Matrix 6 was designed to test the behavior of the condition estimation in the presence of clusters for the smallest singular value. For the other cases, we employed the LAPACK matrix generator xLATMS, which generates random symmetric matrices by multiplying a diagonal matrix with prescribed singular values by random orthogonal matrices from the left and right. For the break1 distribution, all singular values are 1.0 except for one. In the arithmetic and geometric distributions, they decay from 1.0 to a specified smallest singular value in an arithmetic and geometric fashion, respectively. In the "reversed" distributions, the order of the diagonal entries was reversed. For test cases 7 though 12, we used xLATMS to generate a matrix of order
Fast linear algebra is stable
 In preparation
, 2006
"... In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of nbyn matrices can be done by any algorithm in O(n ω+η) operations for any η> 0, then it can be done stably in O(n ω+η) operations for any η> ..."
Abstract

Cited by 25 (15 self)
 Add to MetaCart
In [23] we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of nbyn matrices can be done by any algorithm in O(n ω+η) operations for any η> 0, then it can be done stably in O(n ω+η) operations for any η> 0. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in O(n ω+η) operations. 1