Results 1  10
of
17
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 83 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
An Architecture for Optimal AlltoAll Personalized Communication
, 1994
"... In alltoall personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremel ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
In alltoall personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passing routers. We presentand evaluate a network architecture that executesalltoall communication optimally on a twodimensional torus. The router combines optimal partitions of the AAPC step with a selfsynchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures. The flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 8 torus exceeded 2 GigaBytes/sec or 80 % of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a twodimensional Fast Fourier Transform. 1
Optimizing FORTRAN90 Programs for Data Motion on Massively Parallel Systems
, 1992
"... This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation of a FORTRAN90 program are from array assignments (using the index triplet notation and vector indexing), array operators (e.g. CSHIFT, TRANSPOSE, etc.), and array parameter passing to and from subroutines. Coupled with the variety of ways arrays can be distributed, a FORTRAN90 implementor faces a rich space in which data motion can be organized. A model of data motion and an algebraic representation of data motion and data layout are presented. Yale Extension, a set of layout declarations for directing the compiler in distributing the data, is described. An array reference or an array operation extracted from the source FORTRAN90 program, given a particular data layout specified in Yale E...
Practical Parallel Algorithms for Personalized Communication and Integer Sorting
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS
, 1995
"... A fundamental challenge for parallel computing is to obtain highlevel, architecture independent, algorithms which efficiently execute on generalpurpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel alg ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
A fundamental challenge for parallel computing is to obtain highlevel, architecture independent, algorithms which efficiently execute on generalpurpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel algorithms by making use of these communication primitives. While existing primitives allow an assortment of collective communication routines, they do not handle an important communication event when most or all processors have nonuniformly sized personalized messages to exchange with each other. We focus in this paper on the hrelation personalized communication whose efficient implementation will allow high performance implementations of a large class of algorithms. While most previous hrelation algorithms use randomization, this paper presents a new deterministic approach for hrelation personalized communication with asymptoticaly optimal complexity for h p². As an application, we ...
C³: A parallel model for coarsegrained machines
, 1995
"... In this paper, we propose a model for parallel computation, the C³model. The C³model evaluates, for a given parallel algorithm and target architecture, the complexity of computation, the pattern of communication, and the potential congestion arising during communication. A metric for estimating t ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
In this paper, we propose a model for parallel computation, the C³model. The C³model evaluates, for a given parallel algorithm and target architecture, the complexity of computation, the pattern of communication, and the potential congestion arising during communication. A metric for estimating the effect of link and processor congestion on the performance of a communication operation is developed. This metric allows the evaluation of arbitrary communication operations without the user having to specify fine scheduling details. We describe how the C³model can serve as a platform for the development of coarsegrained algorithms sensitive to the parameters of a parallel machine. The initial validation of the C³model is discussed for the Intel Touchstone Delta. We compare predicted and actual performance of different solutions for communication operations and of various divideandconquer approaches for contour ranking on images.
Exchange of Messages of Different Sizes
 In IRREGULAR '98
"... In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We exten ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We extend previous complexity results to show that the general asynchronous problems are NPcomplete. We present several approximation algorithms and determine which heuristics are best suited to several parallel systems. We conclude with experimental results that show that our algorithms outperform the native alltoall exchange algorithm on an IBM SP2 when the number of processors is odd.
Portable and scalable algorithms for irregular alltoall communication
 In 16th ICDCS
, 1996
"... In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algori ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message startups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm provides deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of highperformance computing (HPC)platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network
Communication Operations on CoarseGrained Mesh Architectures
 Parallel Computing
, 1994
"... In this paper we consider three frequently arising communication operations, onetoall, alltoone, and alltoall. We describe architectureindependent solutions for each operation, as well as solutions tailored towards the mesh architecture. We show how the relationship among the parameters of a ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
In this paper we consider three frequently arising communication operations, onetoall, alltoone, and alltoall. We describe architectureindependent solutions for each operation, as well as solutions tailored towards the mesh architecture. We show how the relationship among the parameters of a parallel machine and the relationship of these parameters to the message size determines the best solution. We discuss performance and scalability issues of our solutions on the Intel Touchstone Delta. Our results show that in order to cover a broad range of scalability for a particular operation, multiple solutions should be employed. Keywords: Parallel processing, coarsegrained machines, communication operations, scalability. Research supported in part by ARPA under contract DABT6392C0022ONR. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing official policies, expressed or implied, of the U.S. government. 1 In...
MultiPhase Redistribution: A CommunicationEfficient Approach to Array Redistribution
, 1994
"... Distributedmemory implementations of several scientific applications require array redistribution. Array redistribution is used in languages such as High Performance Fortran to dynamically change the distribution of arrays across processors. Performing array redistribution incurs two overheads  an ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Distributedmemory implementations of several scientific applications require array redistribution. Array redistribution is used in languages such as High Performance Fortran to dynamically change the distribution of arrays across processors. Performing array redistribution incurs two overheads  an indexing overhead for determining the set of processors to communicate with and the array elements to be communicated, and a communication overhead for performing the necessary irregular alltomany personalized communication. In this paper efficient runtime methods for performing array redistribution are presented. To reduce the indexing overhead, precise closed forms for enumerating the processors to communicate with and the array elements to be communicated are developed for two special cases of array redistribution involving blockcyclically distributed arrays. The general array redistribution problem for blockcyclically distributed arrays can be expressed in terms of these special case...
Scalable stop broadcasting on messagepassing mpps
 IEEE Transactions on Parallel and Distributed Systems
, 1998
"... Abstract—In stop broadcasting, s processors in a pprocessor machine contain a message to be broadcast to all the processors, 1 ≤ s ≤ p. We present a number of different broadcasting algorithms that handle all ranges of s. We show how the performance of each algorithm is influenced by the distribu ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract—In stop broadcasting, s processors in a pprocessor machine contain a message to be broadcast to all the processors, 1 ≤ s ≤ p. We present a number of different broadcasting algorithms that handle all ranges of s. We show how the performance of each algorithm is influenced by the distribution of the s source processors and by the relationships between the distribution and the characteristics of the interconnection network. For the Intel Paragon we show that for each algorithm and machine dimension there exist ideal distributions and distributions on which the performance degrades. For the Cray T3D we also demonstrate dependencies between distributions and machine sizes. To reduce the dependence of the performance on the distribution of sources, we propose a repositioning approach. In this approach, the initial distribution is turned into an ideal distribution of the target broadcasting algorithm. We report experimental results for the Intel Paragon and Cray T3D and discuss scalability and performance. Index Terms—Broadcasting, communication operations, messagepassing MPPs, scalability.