Results 1  10
of
13
A Survey of Collective Communication in WormholeRouted Massively Parallel Computers
 IEEE COMPUTER
, 1994
"... Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke lowlevel message passing primitive ..."
Abstract

Cited by 97 (6 self)
 Add to MetaCart
Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke lowlevel message passing primitives, may use a higherlevel communications library, or may write the program in a data parallel language and rely on the compiler to translate language constructs into communication operations. Whichever method is used, the performance of communication operations directly affects the total computation time of the parallel application. Communication operations may be either pointtopoint, which involves a single source and a single destination, or collective, in which more than two processes participate. This paper discusses the design of collective communication operations for current systems that use the wormhole routing switching strategy, in which messages are divided into small pieces and...
The design and implementation of the parallel outofcore scalapack lu, qr, and cholesky factorization routines. LAPACK Working Note 118 CS97247
, 1997
"... This paper describes the design and implementation of three core factorization routines — LU, QR and Cholesky — included in the outofcore extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full mat ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
This paper describes the design and implementation of three core factorization routines — LU, QR and Cholesky — included in the outofcore extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full matrix is stored on disk and the factorization routines transfer submatrice panels into memory. The ‘leftlooking ’ columnoriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high performance ScaLAPACK factorization routines as incore computational kernels. We present the details of the implementation for the outofcore ScaLAPACK factorization routines, as well as performance and scalability results on a Beowulf linux cluster.
A User's Guide to the BLACS v1.0
, 1995
"... The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of time require ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of time required to implement efficient distributed memory algorithms makes it impractical to rewrite programs for every new parallel machine. The BLACS exist in order to make linear algebra applications both easier to program and more portable. It is for this reason that the BLACS are used as the communication layer for the ScaLAPACK project, which involves implementing the LAPACK library on distributed memory MIMD machines. This report describes the library which has arisen from this project. This work was supported in part by DARPA and ARO under contract number DAAL0391C0047, and in part by the National Science Foundation Science and Technology Center Cooperative Agreement No. CCR8809615. y Dept...
LOCCS: Low Overhead Communication and Computation Subroutines
, 1993
"... Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we provide to the user of parallel machines an easy SPMD type and efficient way of programming. The main purpose of theses routines is to be used in linear algebra applications but also in other fields like image processing or neural networks. This work was partially supported by ARCHIPEL S.A. under contract 820542, by the CNRS and the DRET 1 Introduction Libraries of routines have been proven to be the only way for efficient and secure programming. In scientific parallel computing, the most commonly used libraries are the BLAS, BLACS, PICL and the one provided by vendors. These building blocks allow the portability of codes and an efficient implementation on different machines. The devel...
A User's Guide to the BLACS v1.1
, 1997
"... The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of time req ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of time required to implement efficient distributed memory algorithms makes it impractical to rewrite programs for every new parallel machine. The BLACS exist in order to make linear algebra applications both easier to program and more portable. It is for this reason that the BLACS are used as the communication layer for the ScaLAPACK project, which involves implementing the LAPACK library on distributed memory MIMD machines. This report describes the library which has arisen from this project. This work was supported in part by DARPA and ARO under contract number DAAL0391C0047, and in part by the National Science Foundation Science and Technology Center Cooperative Agreement No. CCR8809615...
Whaley LAPACK Working Note 94, A User's Guide to the BLACS v1.0
, 1995
"... Abstract The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of tim ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Abstract The BLACS (Basic Linear Algebra Communication Subprograms) project is an ongoing investigation whose purpose is to create a linear algebra oriented message passing interface that is implemented efficiently and uniformly across a large range of distributed memory platforms. The length of time required to implement efficient distributed memory algorithms makes it impractical to rewrite programs for every new parallel machine. The BLACS exist in order to make linear algebra applications both easier to program and more portable. It is for this reason that the BLACS are used as the communication layer for the ScaLAPACK project, which involves implementing the LAPACK library on distributed memory MIMD machines.
Efficient Block Cyclic Data Redistribution
 In EuroPar'96
, 1996
"... : Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (t ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (to achieve a good computation and communication efficiency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one distribution to another very quickly. We present here the algorithms we implemented in the SCALAPACK library. A complexity study is made that proves the efficiency of our solution. Timing results on the Intel Paragon and the Cray T3D corroborates the results. We show the gain that can be obtained using the good data distribution with 3 numerical kernels and our redistribution routines. Keywords: parallel computing, parallel linear algebra, personalized alltoall communication, data redistribution, HPF, blockcycli...
Block cyclic array redistribution
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1997
"... Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (to ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (to achieve a good e ciency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one distribution to another very quickly. We present here the algorithms we implemented in the ScaLAPACK library. A complexity study is then made that proves the e ciency of our solution. Timing results on a network of SUN workstations and the Cray T3D using PVM corroborates the results.
Experiences of Parallelising Finiteelement Problems in a Functional Style
, 1995
"... this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the consider ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the considerable assistance provided by the simulator