Results 1  10
of
12
Optimal Grain Size Computation for Pipelined Algorithms
 In Europar'96 Parallel Processing
, 1996
"... . In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytous ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
. In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytouse and efficient way to compute, in the general case, this optimal packet size, on the column LU factorization; the implementation and performance measures are made on an Intel Paragon. Keywords : Communications overlap, pipelined algorithms, optimal packet size computation. 1 Introduction Parallel distributed memory machines improve performances and memory capacity but their use adds an overhead due to the communications. To obtain programs that perform and scale well, this overhead must be hidden. Several solutions exist. The choice of a good data distribution is the first step that can be done to lower the number and the size of communications. Depending of the dependences within the cod...
The Nestor library: a tool for implementing Fortran source to source transformations
, 1998
"... We describe Nestor, a library to easily manipulate Fortran programs through a high level internal representation based on C++ classes. Nestor is a research tool that can be used to quickly implement source to source transformations. The input of the library is Fortran 77, Fortran 90, and HPF 2.0. It ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
We describe Nestor, a library to easily manipulate Fortran programs through a high level internal representation based on C++ classes. Nestor is a research tool that can be used to quickly implement source to source transformations. The input of the library is Fortran 77, Fortran 90, and HPF 2.0. Its current output supports the same languages plus some dialects such as Petit, OpenMP, CrayMP. Compared to SUIF 2.0 that is still announced, Nestor is less ambitious, but is light, ready to use (http://www.enslyon.fr/~gsilber/nestor), fully documented and is better suited for Fortran to Fortran transformations.
Implementing Pipelined Computation and Communication in an HPF Compiler
 In Europar'96 Parallel Processing
, 1996
"... . Many scientific applications can benefit from pipelining computation and communication. Our aim is to provide compiler and runtime support for High Performance Fortran applications that could benefit from these techniques. This paper describes the integration of a library for pipelined computation ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
. Many scientific applications can benefit from pipelining computation and communication. Our aim is to provide compiler and runtime support for High Performance Fortran applications that could benefit from these techniques. This paper describes the integration of a library for pipelined computations in the runtime system. Results on some application kernels are given. 1 Introduction With the introduction of High Performance Fortran (HPF) [KLS + 94], it is possible to use the data parallel programming paradigm in a very convenient way for scientific applications. With current compilation technology, these programs will execute phases of computations and communications on differents sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. each processor executes these phases at the same time and waits until the last processor completes his communication phase. An important task of the HPF compiler is to detect th...
HPFIT: A Set of Integrated Tools for the Parallelization of Applications Using High Performance Fortran
, 1996
"... In this report, we present the HPFIT project whose aim is to provide a set of interactive tools integrated in a single environment to help users to parallelize scientific applications to be run on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL whic ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
In this report, we present the HPFIT project whose aim is to provide a set of interactive tools integrated in a single environment to help users to parallelize scientific applications to be run on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL which includes an editor, a parser, a dependence analysis tool and an optimization kernel. Moreover, we provide the users with a clean interface, so that developers of tools around High Performance Fortran can easily integrate their software within our tool.
Array Redistribution in ScaLAPACK using PVM
 EUROPVM'95: SECOND EUROPEAN PVM USERS' GROUP MEETING
, 1995
"... Linear algebra on distributedmemory parallel computers raises the problem of data distribution of matrices and vectors among the processes. Blockcyclic distribution works well for most algorithms. The block size must be chosen carefully, however, in order to achieve good efficiency and good loa ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Linear algebra on distributedmemory parallel computers raises the problem of data distribution of matrices and vectors among the processes. Blockcyclic distribution works well for most algorithms. The block size must be chosen carefully, however, in order to achieve good efficiency and good load balancing. This choice depends heavily on each operation; hence, it is essential to be able to go from one distribution to another very quickly. We present here the algorithms implemented in the ScaLAPACK library,and we discuss timing results ona network of workstations and on a Cray T3D using PVM.
Parallel FFT Algorithms with Reduced Communication Overhead
, 1998
"... One of the primary objectives in the programming of parallel algorithms is to reduce the effects of the overhead introduced when a given problem is parallelized. A key contributor to overhead is communication time. One way to reduce the communication overhead is to minimize the actual time for commu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
One of the primary objectives in the programming of parallel algorithms is to reduce the effects of the overhead introduced when a given problem is parallelized. A key contributor to overhead is communication time. One way to reduce the communication overhead is to minimize the actual time for communication. Another approach is to hide communication by overlapping it with computation. In this paper a new approach is presented in order to overlap all communication intensive steps appearing in the fourstep FFT algorithm  initial data distribution, matrix transpose, and final data collection  with computation. The presented method is based on a Kronecker product factorization of the fourstep FFT algorithm. Contents 1 Introduction 4 2 Kronecker Products 6 2.1 Stride Permutations . . . . . . . . . . . . . . . . . . . . . . . . 8 3 DFT Matrix Representation 9 4 FourStep FFT Computation 11 4.1 MultiRow FourStep FFT Algorithm . . . . . . . . . . . . . . 11 4.2 MultiColumn FourS...
Efficient Communication Operations On Passive Optical Star Networks
, 1994
"... In this paper, we show how to use the Wavelength Division Multiple Access capabilities of Passive Optical Star Networks for efficiently implementing communication operations that are widely used in parallel applications. We propose algorithms for multiple broadcasting, scattering, gossiping and mult ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we show how to use the Wavelength Division Multiple Access capabilities of Passive Optical Star Networks for efficiently implementing communication operations that are widely used in parallel applications. We propose algorithms for multiple broadcasting, scattering, gossiping and multiscattering, which are very close to the lower bound for these problems. 1 Introduction For massively parallel architectures, the hardware complexity of the interconnection network is much higher than that of the processing units, employing most of the hardware involved [17]. In many optical communication networks, Passive Optical Star technology using wavelength division multiplexing access (WDMA) offers optical multiple access channels that allow significant reduction in the complexity of the specialized routers connected to the processors, along Part of this work was done during a postdoc period at the C.S. dept. of the University of Tennessee, Knoxville, TN 379961301, USA. y Parti...
A Parallel Performance Study of Jacobilike Eigenvalue Solution
, 1994
"... In this report we focus on Jacobi like resolution of the eigenproblem for a real symmetric matrix from a parallel performance point of view: we try to optimize the algorithm working on the communication intensive part of the code. We discuss several parallel implementations and propose an imple ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this report we focus on Jacobi like resolution of the eigenproblem for a real symmetric matrix from a parallel performance point of view: we try to optimize the algorithm working on the communication intensive part of the code. We discuss several parallel implementations and propose an implementation which overlaps the communications by the computations to reach a better efficiency. We show that the overlapping implementation can lead to significant improvements. We conclude by presenting our future work.
Towards Mixed Computation/Communication in Parallel Scientific Libraries
"... This paper presents an overlapping technique of communications by computations based on pipelined communications. This allows to improve the execution time of most parallel numerical algorithms. Some simple examples are developed namely, global combine, matrixvector product and bidimensional Fast ..."
Abstract
 Add to MetaCart
This paper presents an overlapping technique of communications by computations based on pipelined communications. This allows to improve the execution time of most parallel numerical algorithms. Some simple examples are developed namely, global combine, matrixvector product and bidimensional Fast Fourier Transform, to illustrate the efficiency of this technique. Moreover, we propose an unified formalism to express easily the pipelined versions of these algorithms. Finally, we report some experiments on various parallel machines. Keywords: parallel numerical algorithms, communications. 1 Introduction The large development of parallel scientific computers leads the common users to adapt their programs, to take into account the potential of parallelism of their applications. Since efficient parallel compilers are not available today, the simplest issue, is to use parallel numerical libraries [1]. Given an initial distribution of the data (vectors, matrices, etc.), the computational ke...
A Library for Coarse Grain MacroPipelining in Distributed Memory Architectures
, 1994
"... Introduction The most natural way of programming parallel machines is dataparallelism where programs execute phases of computations and communications on different sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. every ..."
Abstract
 Add to MetaCart
Introduction The most natural way of programming parallel machines is dataparallelism where programs execute phases of computations and communications on different sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. every processor executes these phases at the same time and waits until the last processor completes his communication phase. From the perspective of program correctness, these dataparallel programs are much more easier to prove than asynchronous CSPbased parallel programs [Hoa85]. Unfortunately, performances of such programs are bounded by the communications cost. To avoid this problem, the solution is to overlap communications by computations. This is not always possible because of the dependences within the code. If data dependences prevent the use of simple overlap, a solution consists in using a pipelined data parallel algorithm by decreasing the grain of computations, and overlappi