Results 1 - 10
of
15
A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers
, 1992
"... This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ect-based interface to the library routines. The square block s ..."
Abstract
-
Cited by 151 (33 self)
- Add to MetaCart
This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ect-based interface to the library routines. The square block scattered decomposition is described. The implementation of a distributed memory version of the right-looking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrated the scalability of the algorithm.
Software libraries for linear algebra computations on high performance computers
- SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract
-
Cited by 66 (17 self)
- Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
An Object Oriented Design for High Performance Linear Algebra on Distributed Memory Architectures
, 1993
"... We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebra computations on distributed memory multicomputers. This package, when complete, will support distributed matrix operations for symmetric, positive-definite, and non-symmetric cases. In ScaLAPACK++ w ..."
Abstract
-
Cited by 26 (10 self)
- Add to MetaCart
We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebra computations on distributed memory multicomputers. This package, when complete, will support distributed matrix operations for symmetric, positive-definite, and non-symmetric cases. In ScaLAPACK++ we have employed object oriented design methods to enchance scalability, portability, flexibility, and ease-of-use. We illustrate some of these points by describing the implementation of basic algorithms and comment on tradeoffs between elegance, generality, and performance.
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract
-
Cited by 23 (12 self)
- Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highl...
Parallel Tridiagonalization through Two-Step Band Reduction
- In Proceedings of the Scalable High-Performance Computing Conference
, 1994
"... We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the secon ..."
Abstract
-
Cited by 22 (12 self)
- Add to MetaCart
We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS-3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approachthereof [8] is the method of This work...
Two Dimensional Basic Linear Algebra Communication Subprograms
, 1991
"... this paper, we describe extensions to a proposed set of linear algebra communication routines for communicating and manipulating data structures that are distributed among the memories of a distributed memory MIMD computer. In particular, recent experience shows that higher performance can be attain ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
this paper, we describe extensions to a proposed set of linear algebra communication routines for communicating and manipulating data structures that are distributed among the memories of a distributed memory MIMD computer. In particular, recent experience shows that higher performance can be attained on such architectures when parallel dense matrix algorithms utilize a data distribution that views the computational nodes as a logical two dimensional mesh. The motivation for the BLACS continues to be to increase portability, efficiency and modularity at a high level. The audience of the BLACS are mathematical software experts and people with large scale scientific computation to perform. A systematic effort must be made to achieve a de facto standard for the BLACS. ntroduction
The Design of Linear Algebra Libraries for High Performance Computers
, 1993
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followe ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movementbetween di#erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subgrams #BLAS# as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms #BLACS# as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct ...
Parallel Many-Body Simulations Without All-to-All Communication
, 1993
"... Simulations of interacting particles are common in science and engineering, appearing in such diverse disciplines as astrophysics, fluid dynamics, molecular physics, and materials science. These simulations are often computationally intensive and so natural candidates for massively parallel computi ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Simulations of interacting particles are common in science and engineering, appearing in such diverse disciplines as astrophysics, fluid dynamics, molecular physics, and materials science. These simulations are often computationally intensive and so natural candidates for massively parallel computing. Many-body simulations that directly compute interactions between pairs of particles, be they short-range or long-range interactions, have been parallelized in several standard ways. The simplest approaches require all-to-all communication, an expensive communication step. The fastest methods assign a group of nearby particles to a processor, which can lead to load imbalance and be difficult to implement efficiently. We present a new approach, suitable for direct simulations, that avoids all-to-all communication without requiring any geometric clustering. For some computations we find the new method to be the fastest parallel algorithm available; we demonstrate its utility...
Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC/860 Systems
- b(64), c(64), d(64), e(64) common /dvars/ .... f(64,64,0:9) off_O= ... !*** forward substitution *** do j = 1, 64 do i_ = 1, 64, 8 i_up = i_ + 7 if (my_p .gr. O) then call crecv(114, f(i_, j, 0), 8
, 1991
"... We describe an effort to implement the LINPACK Benchmark on two massively parallel distributed memory MIMD computers, the Intel iPSC/860 and DELTA Systems. 1 Introduction For over a decade, the LINPACK benchmark has provided a measure of comparison between computers [6]. This benchmark reflects t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We describe an effort to implement the LINPACK Benchmark on two massively parallel distributed memory MIMD computers, the Intel iPSC/860 and DELTA Systems. 1 Introduction For over a decade, the LINPACK benchmark has provided a measure of comparison between computers [6]. This benchmark reflects the performance of computers ranging from the home-used PC to the most powerful supercomputers when solving a dense system of linear equations, which is a component of many technical applications. The original LINPACK benchmark is reported as three numbers: the performance, in MFLOPS, of the standard LINPACK code when applied to a 100 2 100 problem; the performance for a 1000 2 1000 problem is solved using an equivalent method; and the theoretical peak performance. The MFLOPS attained is computed by dividing the number of floating point operations required, 2n 3 =3+ 2n 2 , by the execution time. More recently, the benchmark has been extended for massively parallel architectures to allow...
Optimal Broadcasting in Mesh-Connected Architectures
, 1991
"... In this paper, we disprove the common assumption that the time for broadcasting in a mesh is at best proportional to the square root of the number of processors, at least in the presence of worm-hole routing. We present an optimal algorithm for broadcasting in mesh-connected distributed-memory ar ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, we disprove the common assumption that the time for broadcasting in a mesh is at best proportional to the square root of the number of processors, at least in the presence of worm-hole routing. We present an optimal algorithm for broadcasting in mesh-connected distributed-memory architectures with worm-hole routing. By organizing the processing nodes in a logical spanning tree, the algorithm executes in time proportional to the logarithm of the number of nodes without inducing contention in the communication network. We restrict the number of nodes in each dimension of the processor mesh to be a power of two. Our method provides insight into how to avoid and/or reduce network contention on meshes for other communication operations. Experimental results on the Intel Touchstone Delta system are included. Keywords: distributed-memory, mesh-connected, broadcast, parallel processing, worm-hole routing 1 Introduction We investigate broadcast algorithms for mesh-co...

