Results 1 - 10
of
85
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract
-
Cited by 49 (26 self)
- Add to MetaCart
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
Applying recursion to serial and parallel QR factorization leads to better performance
"... this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
Stability of Block Algorithms with Fast Level 3 BLAS
- ACM Trans. Math. Soft
, 1992
"... . Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them well-suited to high-performance computers. The numerical stability of the ..."
Abstract
-
Cited by 33 (14 self)
- Add to MetaCart
. Block algorithms are becoming increasingly popular in matrix computations. Since their basic unit of data is a submatrix rather than a scalar they have a higher level of granularity than point algorithms, and this makes them well-suited to high-performance computers. The numerical stability of the block algorithms in the new linear algebra program library LAPACK is investigated here. It is shown that these algorithms have backward error analyses in which the backward error bounds are commensurate with the error bounds for the underlying level 3 BLAS (BLAS3). One implication is that the block algorithms are as stable as the corresponding point algorithms when conventional BLAS3 are used. A second implication is that the use of BLAS3 based on fast matrix multiplication techniques affects the stability only insofar as it increases the constant terms in the normwise backward error bounds. For linear equation solvers employing LU factorization it is shown that fixed precision iterative re...
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Tridiagonal, and Bidiagonal Form
, 1995
"... This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. These routines are important in the solution of eigenproblems. The paper focuses on how building blocks are used to create higher-level library routines. Results are presented that demonstrate the scalability of the reduction routines. The most commonly-used building blocks used in ScaLAPACK are the sequential BLAS, the Parallel BLAS (PBLAS) and the Basic Linear Algebra Communication Subprograms (BLACS). Each of the matrix reduction algorithms consists of a series of steps in each of which one block column (or panel), and/or block row, of the matrix is reduced, followed by an update of the portion of the matrix that has not been factorized so far. This latter phase is performed usin...
A New O(n²) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem
- In progress
, 1997
"... ..."
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
, 1998
"... . We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the update ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
. We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m = n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m = n = 2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm. 1 Introduction LAPACK algorithm DGEQRF requires more floating point operations than LAPACK algorithm DGEQR2, see [1]. Yet, DGEQRF outperforms DGEQR2 on a RS/6000 workstation by nearly a factor of 3 on large matrices. Dongarra, Kaufm...
Fast polar decomposition of an arbitrary matrix
- SIAM J. Sci. Stat. Comput
, 1990
"... Abstract. The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed usingaquadraticallyconvergentalgorithmofHigham SIAMJ. Sci. Statist. Comput.,7 (1986), pp. 1160-1174]. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the us ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
Abstract. The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed usingaquadraticallyconvergentalgorithmofHigham SIAMJ. Sci. Statist. Comput.,7 (1986), pp. 1160-1174]. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the use of a preliminary complete orthogonal decomposition, the algorithm can be extended to arbitrary A. The use ofthe algorithm to compute the positive semidefinite square root ofa Hermitian positive semidefinite matrix is also described. A hybrid algorithm that adaptively switches from the matrix inversion based iteration to a matrix multiplication based iteration due to Kovarik, and to Bj6rck and Bowie, is formulated. The decision when to switch is made using a condition estimator. This "matrix multiplication rich " algorithm is shown to be more efficient on machines for which matrix multiplication can be executed 1.5 times faster than matrix inversion.
A Framework for Symmetric Band Reduction
, 1999
"... this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser alg ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, the band reduction algorithm is repeatedly used until the reduced matrix is tridiagonal. If d = b 1, it is the MHL-algorithm; and if d = 1 is used for each reduction step, it results in the Rutishauser algorithm. However, d need not be chosen this way; indeed, exploiting the freedom we have in choosing d leads to a class of algorithms for banded reduction and tridiagonalization with favorable computational properties. In particular, we can derive algorithms with
The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
, 1994
"... This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standa ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon systems.
Parallel Tridiagonalization through Two-Step Band Reduction
- In Proceedings of the Scalable High-Performance Computing Conference
, 1994
"... We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the secon ..."
Abstract
-
Cited by 22 (12 self)
- Add to MetaCart
We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS-3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approachthereof [8] is the method of This work...

