Results 11  20
of
875
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 85 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
Scalable Computing
 Computer Science Today: Recent Trends and Developments
, 1996
"... . Scalable computing will, over the next few years, become the normal form of computing. In this paper we present a unified framework, based on the BSP model, which aims to serve as a foundation for this evolutionary development. A number of important techniques, tools and methodologies for the desi ..."
Abstract

Cited by 83 (3 self)
 Add to MetaCart
. Scalable computing will, over the next few years, become the normal form of computing. In this paper we present a unified framework, based on the BSP model, which aims to serve as a foundation for this evolutionary development. A number of important techniques, tools and methodologies for the design of sequential algorithms and programs have been developed over the past few decades. In the transition from sequential to scalable computing we will find that new requirements such as universality and predictable performance will necessitate significant changes of emphasis in these areas. Programs for scalable computing, in addition to being fully portable, will have to be efficiently universal, offering high performance, in a predictable way, on any general purpose parallel architecture. The BSP model provides a discipline for the design of scalable programs of this kind. We outline the approach and discuss some of the issues involved. 1 Introduction For fifty years, sequential computin...
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract

Cited by 83 (6 self)
 Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 82 (14 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Compiler Support for MachineIndependent Parallel Programming in Fortran D
, 1991
"... Because of the complexity and variety of parallel architectures, an efficient machineindependent parallel programming model is needed to make parallel computing truly usable for scientific programmers. We believe that Fortran D, a version of Fortran enhanced with data decomposition specifications, ..."
Abstract

Cited by 75 (16 self)
 Add to MetaCart
Because of the complexity and variety of parallel architectures, an efficient machineindependent parallel programming model is needed to make parallel computing truly usable for scientific programmers. We believe that Fortran D, a version of Fortran enhanced with data decomposition specifications, can provide such a programming model. This paper presents the design of a prototype Fortran D compiler for the iPSC/860, a MIMD distributedmemory machine. Issues addressed include data decomposition analysis, guard introduction, communications generation and optimization, program transformations, and storage assignment. A test suite of scientific programs will be used to evaluate the effectiveness of both the compiler technology and programming model for the Fortran D compiler.
Scientific Computing on Bulk Synchronous Parallel Architectures
"... We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures. ..."
Abstract

Cited by 70 (13 self)
 Add to MetaCart
We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures.
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...