Results 1  10
of
28
The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
, 1994
"... This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standa ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
This paper discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon systems.
A Parallel Version of the Unsymmetric Lanczos Algorithm and its Application to QMR
, 1996
"... A new version of the unsymmetric Lanczos algorithm without lookahead is described combining elements of numerical stability and parallel algorithm design. Firstly, stability is obtained by a coupled twoterm procedure that generates Lanczos vectors scaled to unit length. Secondly, the algorithm is ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
A new version of the unsymmetric Lanczos algorithm without lookahead is described combining elements of numerical stability and parallel algorithm design. Firstly, stability is obtained by a coupled twoterm procedure that generates Lanczos vectors scaled to unit length. Secondly, the algorithm is derived by making all inner products of a single iteration step independent such that global synchronization on parallel distributed memory computers is reduced. Among the algorithms using the Lanczos process as a major component, the quasiminimal residual (QMR) method for the solution of systems of linear equations is illustrated by an elegant derivation. The resulting QMR algorithm maintains the favorable properties of the Lanczos algorithm while not increasing computational costs as compared with its corresponding original version.
Cache efficient bidiagonalization using BLAS 2.5 operators
, 2003
"... On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
On cache based computer architectures using current standard algorithms, Householder bidiagonalization requires a significant portion of the execution time for computing matrix singular values and vectors. In this paper we reorganize the sequence of operations for Householder bidiagonalization of a general m × n matrix, so that two ( GEMV) vectormatrix multiplications can be done with one pass of the unreduced trailing part of the matrix through cache. Two new BLAS 2.5 operations approximately cut in half the transfer of data from main memory to cache. We give detailed algorithm descriptions and compare timings with the current LAPACK bidiagonalization algorithm. 1
Fault Tolerant Matrix Operations Using Checksum and Reverse Computation
, 1996
"... In this paper, we present a technique, based on checksum and reverse computation, that enables highperformance matrix operations to be faulttolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR f ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
In this paper, we present a technique, based on checksum and reverse computation, that enables highperformance matrix operations to be faulttolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these highperformance matrix operations with low overhead. 1 Introduction The price and performance of uniprocessor workstations and offtheshelf networking have made networks of workstations (NOWs) a costeffective parallel processing platform that is competitive with supercomputers. The popularity of NOW programming environments like PVM [14] and MPI [17, 30] and the availability of highperformance numerical libraries like ScaLAPACK (Scalable Linear Algebra PACKage) [7] for scienti...
Fault Tolerant Matrix Operations for Parallel and Distributed Systems
, 1996
"... With the proliferation of parallel and distributed systems, it is an increasingly important problem to render parallel applications faulttolerant because such applications are more prone to failures with an increasing number of processors. This dissertation explores fault tolerance in a wide variet ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
With the proliferation of parallel and distributed systems, it is an increasingly important problem to render parallel applications faulttolerant because such applications are more prone to failures with an increasing number of processors. This dissertation explores fault tolerance in a wide variety of matrix operations for parallel and distributed scientific computing. It proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. This faulttolerant computing paradigm relies on checkpointing and rollback recovery using processor and memory redundancy. The paradigm is an algorithmbased approach, in which fault tolerance techniques are tailored into each numerical algorithm without redesigning the algorithm and replicating the processes. The paradigm tolerates the changing and failureprone nature of a computing platform, thereby allowing users to run their parallel codes dynamically and efficiently. This dissertation describes the faulttolerant implemen...
A novel parallel QR algorithm for hybrid distributed memory HPC systems, Technical report 200915, Seminar for applied mathematics
, 2009
"... Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early defla ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPIOpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.
Parallel Application Software on High Performance Computers  Parallel Diagonalisation Routines.
, 1996
"... In this report we list diagonalisation routines available for parallel computers. The methodology of each routine is outlined together with benchmark results on a typical matrix where available. Storage requirements and advantages and disadvantages of the method are also compared. The vast majority ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
In this report we list diagonalisation routines available for parallel computers. The methodology of each routine is outlined together with benchmark results on a typical matrix where available. Storage requirements and advantages and disadvantages of the method are also compared. The vast majority of these routines are available for real dense symmetric matrices only, although there is a known requirement for other data types  such as Hermitian or structured sparse matrices. We will report on new codes as they become available. This report is available from http://www.dl.ac.uk/TCSC/HPCI/ c fl1996, Daresbury Laboratory. We do not accept any responsibility for loss or damage arising from the use of information contained in any of our reports or in any communication about our tests or investigations. ii CONTENTS iii Contents 1 Summary 1 1.1 Test Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.2 Recommendations : : : : : : : : : : :...
Parallel eigenvalue reordering in real Schur forms
"... A parallel variant of the standard eigenvalue reordering method for the real Schur form is presented and discussed. The novel parallel algorithm adopts computational windows and delays multiple outsidewindow updates until each window has been completely reordered locally. By using multiple concurr ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
A parallel variant of the standard eigenvalue reordering method for the real Schur form is presented and discussed. The novel parallel algorithm adopts computational windows and delays multiple outsidewindow updates until each window has been completely reordered locally. By using multiple concurrent windows the parallel algorithm has a high level of concurrency, and most work is level 3 BLAS operations. The presented algorithm is also extended to the generalized real Schur form. Experimental results for ScaLAPACKstyle Fortran 77 implementations on a Linux cluster confirm the efficiency and scalability of our algorithms in terms of more than 16 times of parallel speedup using 64 processor for large scale problems. Even on a single processor our implementation is demonstrated to perform significantly better compared to the stateoftheart serial implementation.
A Parallel Algorithm for the Reduction to Tridiagonal Form for Eigendecomposition
, 1995
"... A new algorithm for the orthogonal reduction of a symmetric matrix to tridiagonal form is developed and analysed. It uses a Cholesky factorization of the original matrix and the rotations are applied to the factors. The idea is similar to the one used for the onesided Jacobi algorithms [B. Zhou an ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
A new algorithm for the orthogonal reduction of a symmetric matrix to tridiagonal form is developed and analysed. It uses a Cholesky factorization of the original matrix and the rotations are applied to the factors. The idea is similar to the one used for the onesided Jacobi algorithms [B. Zhou and R. Brent, A Parallel Ordering Algorithm for Efficient OneSided Jacobi SVD Computations,