Results 1  10
of
28
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 575 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
Software libraries for linear algebra computations on high performance computers
 SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract

Cited by 67 (16 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
The Design of Linear Algebra Libraries for High Performance Computers
, 1993
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followe ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movementbetween di#erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subgrams #BLAS# as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms #BLACS# as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct ...
Templates for Linear Algebra Problems
, 1995
"... The increasing availability of advancedarchitecture computers is having a very significant effect on all spheres of scientific computation, including algorithm research and software development in numerical linear algebra. Linear algebra  in particular, the solution of linear systems of equation ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
The increasing availability of advancedarchitecture computers is having a very significant effect on all spheres of scientific computation, including algorithm research and software development in numerical linear algebra. Linear algebra  in particular, the solution of linear systems of equations and eigenvalue problems  lies at the heart of most calculations in scientific computing. This chapter discusses some of the recent developments in linear algebra designed to help the user on advancedarchitecture computers. Much of the work in developing linear algebra software for advancedarchitecture computers is motivated by the need to solve large problems on the fastest computers available. In this chapter, we focus on four basic issues: (1) the motivation for the work; (2) the development of standards for use in linear algebra and the building blocks for a library; (3) aspects of templates for the solution of large sparse systems of linear algorithm; and (4) templates for the solu...
Application of a High Performance Parallel Eigensolver to Electronic Structure Calculations
"... In this paper we report the development of a very high performance parallel eigensolver based on the portable ScaLAPACK library, and its application to electronic structure calculations in the MPQuest code. This work was done on ASCIRed, a supercomputer based on over 4600 dualprocessor Pentiu ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In this paper we report the development of a very high performance parallel eigensolver based on the portable ScaLAPACK library, and its application to electronic structure calculations in the MPQuest code. This work was done on ASCIRed, a supercomputer based on over 4600 dualprocessor Pentium Pro 1 nodes at Sandia National Laboratories. We report sustained performance in the code of 605GFlops and peak performance in the eigensolver of 684GFlops. This is comparable to performance obtained from MPLinpack on a similar sized problem. For a smaller problem we have sustained performance of 420GFlops in the application and peak performance in the eigensolver of 563GFlops. Impact of this work on the specific application is important, but the development of significant improvements to a portable eigensolver and other libraries will also benefit a number of applications. 1 Introduction MPQuest is a parallel electronic structure program which is used extensively in productio...
A Summary of Block Schemes for Reducing a General Matrix to Hessenberg Form
, 1993
"... 1 1 Introduction 1 2 Block Representations for Representing Products of Householder Matrices 1 3 Block Schemes for Reducing a General Matrix to Hessenberg Form 2 4 Discussion of the Various Update Schemes 4 References 6 ii A Summary of Block Schemes for Reducing a General Matrix to Hessenberg Form ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
1 1 Introduction 1 2 Block Representations for Representing Products of Householder Matrices 1 3 Block Schemes for Reducing a General Matrix to Hessenberg Form 2 4 Discussion of the Various Update Schemes 4 References 6 ii A Summary of Block Schemes for Reducing a General Matrix to Hessenberg Form by Christian H. Bischof Abstract. Various strategies have been proposed for arriving at block algorithms for reducing a general matrix to Hessenberg form by means of orthogonal similarity transformations. This paper reviews and systematically categorizes the various strategies and discusses their computational characteristics. 1 Introduction Let A be an n \Theta n symmetric matrix. Our goal is to compute an orthogonal matrix Q, Q T Q = I such that Q T AQ = H where H is of upper Hessenberg form. The standard algorithm [10] reduces A one column at a time through Householder transformation at a cost of O(4=3n 3 ) flops. It mainly employs matrixvector multiplications and symmetric r...
A Dense Complex Symmetric Indefinite Solver for the Fujitsu AP3000
, 1999
"... This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary targ ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper describes the design, implementation and performance of a parallel direct dense symmetricindefinite solver routine. Such a solver is required for the large complex systems arising from electromagnetic field analysis, such as are generated from the AccuField application. The primary target architecture for the solver is the Fujitsu AP3000, a distributed memory machine based on the UltraSPARC processor. The routine is written entirely in terms of the DBLAS Distributed Library, recently extended for complex precision. It uses the BunchKaufman diagonal pivoting method and is based on the LAPACK algorithm, with several modi cations required for efficient parallel implementation and one modification to reduce the amount of symmetric pivoting. Currently the routine uses a standard BLAS computational interface and can use either the MPI, BLACS or VPPLib communication interfaces (the latter is only available under the APruntime V2.0 system for the AP3000). The routine outperforms its e...
Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures
"... Abstract—Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. This is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen’s algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this leftlooking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tallskinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms. I.