Results 1  10
of
16
A Survey of Collective Communication in WormholeRouted Massively Parallel Computers
 IEEE COMPUTER
, 1994
"... Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke lowlevel message passing primitive ..."
Abstract

Cited by 98 (6 self)
 Add to MetaCart
Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke lowlevel message passing primitives, may use a higherlevel communications library, or may write the program in a data parallel language and rely on the compiler to translate language constructs into communication operations. Whichever method is used, the performance of communication operations directly affects the total computation time of the parallel application. Communication operations may be either pointtopoint, which involves a single source and a single destination, or collective, in which more than two processes participate. This paper discusses the design of collective communication operations for current systems that use the wormhole routing switching strategy, in which messages are divided into small pieces and...
The design and implementation of the parallel outofcore scalapack lu, qr, and cholesky factorization routines. LAPACK Working Note 118 CS97247
, 1997
"... This paper describes the design and implementation of three core factorization routines — LU, QR and Cholesky — included in the outofcore extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full mat ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
This paper describes the design and implementation of three core factorization routines — LU, QR and Cholesky — included in the outofcore extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full matrix is stored on disk and the factorization routines transfer submatrice panels into memory. The ‘leftlooking ’ columnoriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high performance ScaLAPACK factorization routines as incore computational kernels. We present the details of the implementation for the outofcore ScaLAPACK factorization routines, as well as performance and scalability results on a Beowulf linux cluster.
LOCCS: Low Overhead Communication and Computation Subroutines
, 1993
"... Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we provide to the user of parallel machines an easy SPMD type and efficient way of programming. The main purpose of theses routines is to be used in linear algebra applications but also in other fields like image processing or neural networks. This work was partially supported by ARCHIPEL S.A. under contract 820542, by the CNRS and the DRET 1 Introduction Libraries of routines have been proven to be the only way for efficient and secure programming. In scientific parallel computing, the most commonly used libraries are the BLAS, BLACS, PICL and the one provided by vendors. These building blocks allow the portability of codes and an efficient implementation on different machines. The devel...
Efficient Block Cyclic Data Redistribution
 In EuroPar'96
, 1996
"... : Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (t ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
: Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Blockcyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (to achieve a good computation and communication efficiency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one distribution to another very quickly. We present here the algorithms we implemented in the SCALAPACK library. A complexity study is made that proves the efficiency of our solution. Timing results on the Intel Paragon and the Cray T3D corroborates the results. We show the gain that can be obtained using the good data distribution with 3 numerical kernels and our redistribution routines. Keywords: parallel computing, parallel linear algebra, personalized alltoall communication, data redistribution, HPF, blockcycli...
Array Redistribution in ScaLAPACK using PVM
 EUROPVM'95: SECOND EUROPEAN PVM USERS' GROUP MEETING
, 1995
"... Linear algebra on distributedmemory parallel computers raises the problem of data distribution of matrices and vectors among the processes. Blockcyclic distribution works well for most algorithms. The block size must be chosen carefully, however, in order to achieve good efficiency and good loa ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Linear algebra on distributedmemory parallel computers raises the problem of data distribution of matrices and vectors among the processes. Blockcyclic distribution works well for most algorithms. The block size must be chosen carefully, however, in order to achieve good efficiency and good load balancing. This choice depends heavily on each operation; hence, it is essential to be able to go from one distribution to another very quickly. We present here the algorithms implemented in the ScaLAPACK library,and we discuss timing results ona network of workstations and on a Cray T3D using PVM.
Experiences of Parallelising Finiteelement Problems in a Functional Style
, 1995
"... this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the consider ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
this paper we demonstrate: (a) the relative simplicity of the functional approach for parallelizing a complex program compared with the conventional procedural approach; (b) the suitability of functional languages for prototyping parallel algorithms to improve an implementation; and (c) the considerable assistance provided by the simulator