Results 1 
9 of
9
Optimal Grain Size Computation for Pipelined Algorithms
 In Europar'96 Parallel Processing
, 1996
"... . In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytous ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
. In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM 3 library, which provides an easytouse and efficient way to compute, in the general case, this optimal packet size, on the column LU factorization; the implementation and performance measures are made on an Intel Paragon. Keywords : Communications overlap, pipelined algorithms, optimal packet size computation. 1 Introduction Parallel distributed memory machines improve performances and memory capacity but their use adds an overhead due to the communications. To obtain programs that perform and scale well, this overhead must be hidden. Several solutions exist. The choice of a good data distribution is the first step that can be done to lower the number and the size of communications. Depending of the dependences within the cod...
The Nestor library: a tool for implementing Fortran source to source transformations
, 1998
"... We describe Nestor, a library to easily manipulate Fortran programs through a high level internal representation based on C++ classes. Nestor is a research tool that can be used to quickly implement source to source transformations. The input of the library is Fortran 77, Fortran 90, and HPF 2.0. It ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
We describe Nestor, a library to easily manipulate Fortran programs through a high level internal representation based on C++ classes. Nestor is a research tool that can be used to quickly implement source to source transformations. The input of the library is Fortran 77, Fortran 90, and HPF 2.0. Its current output supports the same languages plus some dialects such as Petit, OpenMP, CrayMP. Compared to SUIF 2.0 that is still announced, Nestor is less ambitious, but is light, ready to use (http://www.enslyon.fr/~gsilber/nestor), fully documented and is better suited for Fortran to Fortran transformations.
Implementing Pipelined Computation and Communication in an HPF Compiler
 In Europar'96 Parallel Processing
, 1996
"... . Many scientific applications can benefit from pipelining computation and communication. Our aim is to provide compiler and runtime support for High Performance Fortran applications that could benefit from these techniques. This paper describes the integration of a library for pipelined computation ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
. Many scientific applications can benefit from pipelining computation and communication. Our aim is to provide compiler and runtime support for High Performance Fortran applications that could benefit from these techniques. This paper describes the integration of a library for pipelined computations in the runtime system. Results on some application kernels are given. 1 Introduction With the introduction of High Performance Fortran (HPF) [KLS + 94], it is possible to use the data parallel programming paradigm in a very convenient way for scientific applications. With current compilation technology, these programs will execute phases of computations and communications on differents sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. each processor executes these phases at the same time and waits until the last processor completes his communication phase. An important task of the HPF compiler is to detect th...
HPFIT: A Set of Integrated Tools for the Parallelization of Applications Using High Performance Fortran
, 1996
"... In this report, we present the HPFIT project whose aim is to provide a set of interactive tools integrated in a single environment to help users to parallelize scientific applications to be run on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL whic ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
In this report, we present the HPFIT project whose aim is to provide a set of interactive tools integrated in a single environment to help users to parallelize scientific applications to be run on distributed memory parallel computers. HPFIT is built around a restructuring tool called TransTOOL which includes an editor, a parser, a dependence analysis tool and an optimization kernel. Moreover, we provide the users with a clean interface, so that developers of tools around High Performance Fortran can easily integrate their software within our tool.
Parallel FFT Algorithms with Reduced Communication Overhead
, 1998
"... One of the primary objectives in the programming of parallel algorithms is to reduce the effects of the overhead introduced when a given problem is parallelized. A key contributor to overhead is communication time. One way to reduce the communication overhead is to minimize the actual time for commu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
One of the primary objectives in the programming of parallel algorithms is to reduce the effects of the overhead introduced when a given problem is parallelized. A key contributor to overhead is communication time. One way to reduce the communication overhead is to minimize the actual time for communication. Another approach is to hide communication by overlapping it with computation. In this paper a new approach is presented in order to overlap all communication intensive steps appearing in the fourstep FFT algorithm  initial data distribution, matrix transpose, and final data collection  with computation. The presented method is based on a Kronecker product factorization of the fourstep FFT algorithm. Contents 1 Introduction 4 2 Kronecker Products 6 2.1 Stride Permutations . . . . . . . . . . . . . . . . . . . . . . . . 8 3 DFT Matrix Representation 9 4 FourStep FFT Computation 11 4.1 MultiRow FourStep FFT Algorithm . . . . . . . . . . . . . . 11 4.2 MultiColumn FourS...
Efficient Communication Operations On Passive Optical Star Networks
, 1994
"... In this paper, we show how to use the Wavelength Division Multiple Access capabilities of Passive Optical Star Networks for efficiently implementing communication operations that are widely used in parallel applications. We propose algorithms for multiple broadcasting, scattering, gossiping and mult ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we show how to use the Wavelength Division Multiple Access capabilities of Passive Optical Star Networks for efficiently implementing communication operations that are widely used in parallel applications. We propose algorithms for multiple broadcasting, scattering, gossiping and multiscattering, which are very close to the lower bound for these problems. 1 Introduction For massively parallel architectures, the hardware complexity of the interconnection network is much higher than that of the processing units, employing most of the hardware involved [17]. In many optical communication networks, Passive Optical Star technology using wavelength division multiplexing access (WDMA) offers optical multiple access channels that allow significant reduction in the complexity of the specialized routers connected to the processors, along Part of this work was done during a postdoc period at the C.S. dept. of the University of Tennessee, Knoxville, TN 379961301, USA. y Parti...
Towards Mixed Computation/Communication in Parallel Scientific Libraries
"... This paper presents an overlapping technique of communications by computations based on pipelined communications. This allows to improve the execution time of most parallel numerical algorithms. Some simple examples are developed namely, global combine, matrixvector product and bidimensional Fast ..."
Abstract
 Add to MetaCart
This paper presents an overlapping technique of communications by computations based on pipelined communications. This allows to improve the execution time of most parallel numerical algorithms. Some simple examples are developed namely, global combine, matrixvector product and bidimensional Fast Fourier Transform, to illustrate the efficiency of this technique. Moreover, we propose an unified formalism to express easily the pipelined versions of these algorithms. Finally, we report some experiments on various parallel machines. Keywords: parallel numerical algorithms, communications. 1 Introduction The large development of parallel scientific computers leads the common users to adapt their programs, to take into account the potential of parallelism of their applications. Since efficient parallel compilers are not available today, the simplest issue, is to use parallel numerical libraries [1]. Given an initial distribution of the data (vectors, matrices, etc.), the computational ke...
A Library for Coarse Grain MacroPipelining in Distributed Memory Architectures
, 1994
"... Introduction The most natural way of programming parallel machines is dataparallelism where programs execute phases of computations and communications on different sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. every ..."
Abstract
 Add to MetaCart
Introduction The most natural way of programming parallel machines is dataparallelism where programs execute phases of computations and communications on different sets of data and no overlap exists between communications and computations. Moreover, communication phases are synchronous, i.e. every processor executes these phases at the same time and waits until the last processor completes his communication phase. From the perspective of program correctness, these dataparallel programs are much more easier to prove than asynchronous CSPbased parallel programs [Hoa85]. Unfortunately, performances of such programs are bounded by the communications cost. To avoid this problem, the solution is to overlap communications by computations. This is not always possible because of the dependences within the code. If data dependences prevent the use of simple overlap, a solution consists in using a pipelined data parallel algorithm by decreasing the grain of computations, and overlappi
Parallel 3D air flow simulation on workstation clusters
, 1998
"... Thesee is a 3D panel method code, which calculates the characteristic of a wing in an inviscid, incompressible, irrotational, and steady airflow, in order to design new paragliders and sails. In this paper, we present the parallelization of Thesee for low cost workstation /PC clusters. Thesee has be ..."
Abstract
 Add to MetaCart
Thesee is a 3D panel method code, which calculates the characteristic of a wing in an inviscid, incompressible, irrotational, and steady airflow, in order to design new paragliders and sails. In this paper, we present the parallelization of Thesee for low cost workstation /PC clusters. Thesee has been parallelized using the ScaLAPACK library routines in a systematic manner that lead to a low cost development. The code written in C is thus very portable since it uses only high level libraries. This design was very efficient in term of manpower and gave good performance results. The code performances were measured on 3 clusters of computers connected by different LANs: an Ethernet LAN of SUN SPARCstation, an ATM LAN of SUN SPARCstation and a Myrinet LAN of PCs. The last one was the less expensive and gave the best timing results and superlinear speedup.