Results 1  10
of
20
Communicationavoiding parallel and sequential QR factorizations
, 2008
"... We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in ei ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases). Our first algorithm, Tall Skinny QR (TSQR), factors m × n matrices in a onedimensional (1D) block cyclic row layout, storing the Q factor (if desired) implicitly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Θ(log P) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Θ(mn/W) block reads and writes (as a function of the fast memory size W), which is
Efficient Parallel OutofCore Implementation of the Cholesky Factorization
, 1999
"... In this paper we describe two efficient parallel outofcore implementations of the Cholesky factorization. We use the Parallel OutofCore Linear Algebra Package (POOCLAPACK) as an extension to the Parallel Linear Algebra Package (PLAPACK) to implement our outofcore algorithms. The first algorith ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
In this paper we describe two efficient parallel outofcore implementations of the Cholesky factorization. We use the Parallel OutofCore Linear Algebra Package (POOCLAPACK) as an extension to the Parallel Linear Algebra Package (PLAPACK) to implement our outofcore algorithms. The first algorithm uses incore kernels with additional code to manage the I/O. This is the classical approach to outofcore implementations of the Cholesky factorization. Our second algorithm adds an outofcore implementation of the triangular solve with multiple right hand sides, which doesn't simply bring code incore and run the incore algorithm. This algorithm has the added benefit of requiring fewer copies of the matrix to be incore at one time, thus allowing more of the matrix to be incore at one time. Despite the extreme simplicity of POOCLAPACK and our outofcore algorithm, the outofcore Cholesky factorization implementation is shown to achieve in excess of 80% of peak performance on a 64 node configuration of the Cray T3E600.
Packed Storage Extension for ScaLAPACK
"... We describe a new extension to ScaLAPACK [2] for computing with symmetric (Hermitian) matrices stored in a packed form. The new code is built upon the ScaLAPACK routines for full dense storage for a high degree of software reuse. The original ScaLAPACK stores a symmetric matrix as a full matrix but ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We describe a new extension to ScaLAPACK [2] for computing with symmetric (Hermitian) matrices stored in a packed form. The new code is built upon the ScaLAPACK routines for full dense storage for a high degree of software reuse. The original ScaLAPACK stores a symmetric matrix as a full matrix but accesses only the lower or upper triangular part. The new code enables more efficient use of memory by storing only the lower or upper triangular part of a symmetric (Hermitian) matrix. The packed storage scheme distributes the matrix by block column panels. Within each panel, the matrix is stored as a regular ScaLAPACK matrix. This storage arrangement simplifies the subroutine interface and code reuse. Routines PxPPTRF#PxPPTRS implement the Cholesky factorization and solution for symmetric (Hermitian) linear systems in packed storage. Routines PxSPEV#PxSPEVX (PxHPEV#PxHPEVX) implement the computation of eigenvalues and eigenvectors for symmetric (Hermitian) matrices in packed storage. Routines PxSPGVX #PxHPGVX# implement the expert driver for the generalized eigenvalue problem for symmetric (Hermitian) matrices in packed storage. Performance results on the Intel Paragon suggest that the packed storage scheme incurs only a small time overhead over the full storage scheme.
A parallel distributed solver for large dense symmetric systems: applications to geodesy and electromagnetism problems, Int
 J. of High Performance Computing Applications
"... In this paper we describe the parallel distributed implementation of a linear solver for largescale applications involving real symmetric positive definite or complex symmetric nonHermitian dense systems. The advantage of this routine is that it performs a Cholesky factorization by requiring half ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper we describe the parallel distributed implementation of a linear solver for largescale applications involving real symmetric positive definite or complex symmetric nonHermitian dense systems. The advantage of this routine is that it performs a Cholesky factorization by requiring half the storage needed by the standard parallel libraries ScaLAPACK and PLAPACK. Our solver uses a Jvariant Cholesky algorithm and a onedimensional blockcyclic column data distribution but gives similar Gigaflops performance when applied to problems that can be solved on moderately parallel computers with up to 32 processors. Experiments and performance comparisons with ScaLAPACK and PLAPACK on our target applications are presented. These applications arise from the Earth’s gravity field recovery and computational electromagnetics.
Solving “Large” Dense Matrix Problems on MultiCore Processors and GPUs
, 2009
"... Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding OutofCore (OOC) implement ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding OutofCore (OOC) implementations of dense linear algebra operations (where data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming highperformance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000×100, 000 symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not necessary to utilize distributedmemory architectures with massive memories if one is willing to wait longer for the solution to be computed on a fast multithreaded architecture like a multicore computer or a GPU. This paper provides evidence in support of these claims.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
IMPLEMENTING COMMUNICATIONOPTIMAL PARALLEL AND SEQUENTIAL QR FACTORIZATIONS
, 809
"... Abstract. We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms for tall and skinny matrices lead to significant ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example up to 6.7x over ScaLAPACK. The parallel algorithm for general rectangular matrices is estimated to show significant speedups over ScaLAPACK, up to 22x over ScaLAPACK. 1. Introduction. In
Effective outofcore parallel delaunay mesh refinement using offtheshelf software
 In 20th IEEE International Parallel and Distributed Processing Symposium
, 2006
"... Software ..."
Implementation of OutofCore Cholesky and QR Factorizations with POOCLAPACK
, 2000
"... In this paper parallel implementation of outofcore Cholesky factorization is used to introduce the Parallel OutofCore Linear Algebra Package (POOCLAPACK), a flexible infrastructure for parallel implementation of outofcore linear algebra operations. POOCLAPACK builds on the Parallel Linear A ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper parallel implementation of outofcore Cholesky factorization is used to introduce the Parallel OutofCore Linear Algebra Package (POOCLAPACK), a flexible infrastructure for parallel implementation of outofcore linear algebra operations. POOCLAPACK builds on the Parallel Linear Algebra Package (PLAPACK) for incore parallel dense linear algebra computation. Despite the extreme simplicity of POOCLAPACK, the outofcore Cholesky factorization implementation is shown to achieve in excess of 80% of peak performance on a 64 node configuration of the Cray T3E600. The insights gained from examining the Cholesky factorization have been applied to the much more difficult and important QR factorization operation. Preliminary results for parallel implementation of the resulting OOC QR factorization algorithm are included.