Results 1 -
8 of
8
The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers
- SIAM J. Sci. Comput
, 1997
"... Abstract. The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conqu ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
Abstract. The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many floating point operations as the best serial QR algorithm, it can be simply constructed from a small set of highly parallelizable matrix building blocks within Level 3 basic linear algebra subroutines (BLAS). Efficient implementations of these building blocks are available on a wide range of machines. In some ill-conditioned cases, the algorithm may lose numerical stability, but this can easily be detected and compensated for. The algorithm reached 31 % efficiency with respect to the underlying PUMMA matrix multiplication and 82 % efficiency with respect to the underlying ScaLAPACK matrix inversion on a 256 processor Intel Touchstone Delta system, and 41 % efficiency with respect to the matrix multiplication in CMSSL on a 32 node Thinking Machines CM-5 with vector units. Our performance model predicts the performance reasonably accurately. To take advantage of the geometric nature of SDC algorithms, we have designed a graphical user interface to let the user choose the spectral decomposition according to specified regions in the complex plane.
A Parallel Version of the Quasi-Minimal Residual Method Based on Coupled Two-Term Recurrences
, 1996
"... For the solution of linear systems of equations with unsymmetric coefficient matrix, Freund and Nachtigal (SIAM J. Sci. Comput. 15 (1994), 313--337) proposed a Krylov subspace method called Quasi-Minimal Residual method (QMR). The two main ingredients of QMR are the unsymmetric Lanczos algorithm and ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
For the solution of linear systems of equations with unsymmetric coefficient matrix, Freund and Nachtigal (SIAM J. Sci. Comput. 15 (1994), 313--337) proposed a Krylov subspace method called Quasi-Minimal Residual method (QMR). The two main ingredients of QMR are the unsymmetric Lanczos algorithm and the quasi-minimal residual approach that minimizes a factor of the residual vector rather than the residual itself. The Lanczos algorithm spans a Krylov subspace by generating two sequences of biorthogonal vectors called Lanczos vectors. Due to the orthogonalization and scaling of the Lanczos vectors, algorithms that make use of the Lanczos process contain inner products leading to global communication and synchronization on parallel processors. For massively parallel computers, these effects cause delays preventing scalability of the implementation. Consequently, parallel algorithms should avoid global synchronization as far as possible. We propose a new version of QMR with the followin...
Communication-avoiding parallel and sequential QR factorizations
, 2008
"... We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in ei ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases). Our first algorithm, Tall Skinny QR (TSQR), factors m × n matrices in a one-dimensional (1-D) block cyclic row layout, storing the Q factor (if desired) implicitly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Θ(log P) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Θ(mn/W) block reads and writes (as a function of the fast memory size W), which is
A Parallel Algorithm for Computing the Polar Decomposition
, 1994
"... The polar decomposition A = UH of a rectangular matrix A, where U is unitary and H is Hermitian positive semidefinite, is an important tool in various applications, including aerospace computations, factor analysis and signal processing. We consider a pth order iteration for computing U that invo ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The polar decomposition A = UH of a rectangular matrix A, where U is unitary and H is Hermitian positive semidefinite, is an important tool in various applications, including aerospace computations, factor analysis and signal processing. We consider a pth order iteration for computing U that involves p independent matrix inversions per step and which is hence very amenable to parallel computation. We show that scaling the iterates speeds convergence of the iteration but makes the iteration only conditionally stable, with the backward error typically 2 (A) times bigger than the unit roundoff. In our implementation of the iteration on the Kendall Square Research KSR1 virtual shared memory MIMD computer we take p to be the number of processors (p 16 in our experiments). Our code is found to be significantly faster than two existing techniques for computing the polar decomposition: one a Newton iteration, the other based on the singular value decomposition. Key words. polar d...
On the error analysis and implementation of some eigenvalue decomposition and singular value decomposition algorithms
, 1996
"... Many algorithms exist for computing the symmetric eigendecomposition, the singular value decomposition and the generalized singular value decomposition. In this thesis, we present several new algorithms and improvements on old algorithms, analyzing them with respect to their speed, accuracy, and sto ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Many algorithms exist for computing the symmetric eigendecomposition, the singular value decomposition and the generalized singular value decomposition. In this thesis, we present several new algorithms and improvements on old algorithms, analyzing them with respect to their speed, accuracy, and storage requirements. We rst discuss the variations on the bisection algorithm for nding eigenvalues of symmetric tridiagonal matrices. We show the challenges in implementing a correct al-gorithm with oating point arithmetic. We show how reasonable looking but incorrect implementations can fail. We carefully de ne correctness, and present several implementa-tions that we rigorously prove correct. We then discuss a fast implementation of bisection using parallel pre x. We show many numerical examples of the instability of this algorithm, and then discuss its forward error and backward error analysis. We also discuss possible ways to stabilize it by using iterative re nement. Finally, we discuss how to use a divide-and-conquer algorithm to compute the sin-gular value decomposition and solve the linear least squares problem, and how to implement
Abstract Communication-efficient parallel generic
"... The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. In this paper, we consider the parallel complexity of generic pairwise elimination, special cases of which include Gaussian elimination with pairwise pivoting, Gaussian elimination ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. In this paper, we consider the parallel complexity of generic pairwise elimination, special cases of which include Gaussian elimination with pairwise pivoting, Gaussian elimination over a finite field, generic Neville elimination and Givens reduction. We develop a new block-recursive, communication-efficient BSP algorithm for generic pairwise elimination. 1

