Results 1  10
of
15
A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers
, 1992
"... This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block s ..."
Abstract

Cited by 159 (31 self)
 Add to MetaCart
This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block scattered decomposition is described. The implementation of a distributed memory version of the rightlooking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrated the scalability of the algorithm.
Software libraries for linear algebra computations on high performance computers
 SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract

Cited by 67 (16 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
An Object Oriented Design for High Performance Linear Algebra on Distributed Memory Architectures
, 1993
"... We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebra computations on distributed memory multicomputers. This package, when complete, will support distributed matrix operations for symmetric, positivedefinite, and nonsymmetric cases. In ScaLAPACK++ w ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebra computations on distributed memory multicomputers. This package, when complete, will support distributed matrix operations for symmetric, positivedefinite, and nonsymmetric cases. In ScaLAPACK++ we have employed object oriented design methods to enchance scalability, portability, flexibility, and easeofuse. We illustrate some of these points by describing the implementation of basic algorithms and comment on tradeoffs between elegance, generality, and performance.
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...
Parallel Tridiagonalization through TwoStep Band Reduction
 In Proceedings of the Scalable HighPerformance Computing Conference
, 1994
"... We present a twostep variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrowbanded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the secon ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
We present a twostep variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrowbanded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approachthereof [8] is the method of This work...
The Design of Linear Algebra Libraries for High Performance Computers
, 1993
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followe ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movementbetween di#erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subgrams #BLAS# as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms #BLACS# as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct ...
Two Dimensional Basic Linear Algebra Communication Subprograms
, 1991
"... this paper, we describe extensions to a proposed set of linear algebra communication routines for communicating and manipulating data structures that are distributed among the memories of a distributed memory MIMD computer. In particular, recent experience shows that higher performance can be attain ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
this paper, we describe extensions to a proposed set of linear algebra communication routines for communicating and manipulating data structures that are distributed among the memories of a distributed memory MIMD computer. In particular, recent experience shows that higher performance can be attained on such architectures when parallel dense matrix algorithms utilize a data distribution that views the computational nodes as a logical two dimensional mesh. The motivation for the BLACS continues to be to increase portability, efficiency and modularity at a high level. The audience of the BLACS are mathematical software experts and people with large scale scientific computation to perform. A systematic effort must be made to achieve a de facto standard for the BLACS. ntroduction
Parallel ManyBody Simulations Without AlltoAll Communication
, 1993
"... Simulations of interacting particles are common in science and engineering, appearing in such diverse disciplines as astrophysics, fluid dynamics, molecular physics, and materials science. These simulations are often computationally intensive and so natural candidates for massively parallel computi ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Simulations of interacting particles are common in science and engineering, appearing in such diverse disciplines as astrophysics, fluid dynamics, molecular physics, and materials science. These simulations are often computationally intensive and so natural candidates for massively parallel computing. Manybody simulations that directly compute interactions between pairs of particles, be they shortrange or longrange interactions, have been parallelized in several standard ways. The simplest approaches require alltoall communication, an expensive communication step. The fastest methods assign a group of nearby particles to a processor, which can lead to load imbalance and be difficult to implement efficiently. We present a new approach, suitable for direct simulations, that avoids alltoall communication without requiring any geometric clustering. For some computations we find the new method to be the fastest parallel algorithm available; we demonstrate its utility...
Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC/860 Systems
 b(64), c(64), d(64), e(64) common /dvars/ .... f(64,64,0:9) off_O= ... !*** forward substitution *** do j = 1, 64 do i_ = 1, 64, 8 i_up = i_ + 7 if (my_p .gr. O) then call crecv(114, f(i_, j, 0), 8
, 1991
"... We describe an effort to implement the LINPACK Benchmark on two massively parallel distributed memory MIMD computers, the Intel iPSC/860 and DELTA Systems. 1 Introduction For over a decade, the LINPACK benchmark has provided a measure of comparison between computers [6]. This benchmark reflects t ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We describe an effort to implement the LINPACK Benchmark on two massively parallel distributed memory MIMD computers, the Intel iPSC/860 and DELTA Systems. 1 Introduction For over a decade, the LINPACK benchmark has provided a measure of comparison between computers [6]. This benchmark reflects the performance of computers ranging from the homeused PC to the most powerful supercomputers when solving a dense system of linear equations, which is a component of many technical applications. The original LINPACK benchmark is reported as three numbers: the performance, in MFLOPS, of the standard LINPACK code when applied to a 100 2 100 problem; the performance for a 1000 2 1000 problem is solved using an equivalent method; and the theoretical peak performance. The MFLOPS attained is computed by dividing the number of floating point operations required, 2n 3 =3+ 2n 2 , by the execution time. More recently, the benchmark has been extended for massively parallel architectures to allow...
Optimal Broadcasting in MeshConnected Architectures
, 1991
"... In this paper, we disprove the common assumption that the time for broadcasting in a mesh is at best proportional to the square root of the number of processors, at least in the presence of wormhole routing. We present an optimal algorithm for broadcasting in meshconnected distributedmemory ar ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this paper, we disprove the common assumption that the time for broadcasting in a mesh is at best proportional to the square root of the number of processors, at least in the presence of wormhole routing. We present an optimal algorithm for broadcasting in meshconnected distributedmemory architectures with wormhole routing. By organizing the processing nodes in a logical spanning tree, the algorithm executes in time proportional to the logarithm of the number of nodes without inducing contention in the communication network. We restrict the number of nodes in each dimension of the processor mesh to be a power of two. Our method provides insight into how to avoid and/or reduce network contention on meshes for other communication operations. Experimental results on the Intel Touchstone Delta system are included. Keywords: distributedmemory, meshconnected, broadcast, parallel processing, wormhole routing 1 Introduction We investigate broadcast algorithms for meshco...