Results 1 - 10
of
17
Efficient Algorithms for All-to-All Communications in Multi-Port Message-Passing Systems
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... Abstract—We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
Abstract—We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred. Index Terms—All-to-all broadcast, all-to-all personalized communication, complete exchange, concatenation operation, distributedmemory system, index operation, message-passing system, multiscatter/gather, parallel system.
An Architecture for Optimal All-to-All Personalized Communication
, 1994
"... In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremel ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passing routers. We presentand evaluate a network architecture that executesallto-all communication optimally on a two-dimensional torus. The router combines optimal partitions of the AAPC step with a selfsynchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures. The flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 8 torus exceeded 2 GigaBytes/sec or 80 % of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a two-dimensional Fast Fourier Transform. 1
Optimizing FORTRAN-90 Programs for Data Motion on Massively Parallel Systems
, 1992
"... This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN-90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN-90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation of a FORTRAN90 program are from array assignments (using the index triplet notation and vector indexing), array operators (e.g. CSHIFT, TRANSPOSE, etc.), and array parameter passing to and from subroutines. Coupled with the variety of ways arrays can be distributed, a FORTRAN-90 implementor faces a rich space in which data motion can be organized. A model of data motion and an algebraic representation of data motion and data layout are presented. Yale Extension, a set of layout declarations for directing the compiler in distributing the data, is described. An array reference or an array operation extracted from the source FORTRAN-90 program, given a particular data layout specified in Yale E...
Practical Parallel Algorithms for Personalized Communication and Integer Sorting
- ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS
, 1995
"... A fundamental challenge for parallel computing is to obtain high-level, architecture independent, algorithms which efficiently execute on general-purpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel alg ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
A fundamental challenge for parallel computing is to obtain high-level, architecture independent, algorithms which efficiently execute on general-purpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel algorithms by making use of these communication primitives. While existing primitives allow an assortment of collective communication routines, they do not handle an important communication event when most or all processors have non-uniformly sized personalized messages to exchange with each other. We focus in this paper on the h-relation personalized communication whose efficient implementation will allow high performance implementations of a large class of algorithms. While most previous h-relation algorithms use randomization, this paper presents a new deterministic approach for h-relation personalized communication with asymptoticaly optimal complexity for h p². As an application, we ...
C³: A parallel model for coarse-grained machines
, 1995
"... In this paper, we propose a model for parallel computation, the C³-model. The C³-model evaluates, for a given parallel algorithm and target architecture, the complexity of computation, the pattern of communication, and the potential congestion arising during communication. A metric for estimating t ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
In this paper, we propose a model for parallel computation, the C³-model. The C³-model evaluates, for a given parallel algorithm and target architecture, the complexity of computation, the pattern of communication, and the potential congestion arising during communication. A metric for estimating the effect of link and processor congestion on the performance of a communication operation is developed. This metric allows the evaluation of arbitrary communication operations without the user having to specify fine scheduling details. We describe how the C³-model can serve as a platform for the development of coarsegrained algorithms sensitive to the parameters of a parallel machine. The initial validation of the C³-model is discussed for the Intel Touchstone Delta. We compare predicted and actual performance of different solutions for communication operations and of various divide-and-conquer approaches for contour ranking on images.
Exchange of Messages of Different Sizes
- In IRREGULAR '98
"... In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, non-uniform versions of all-to-all (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We exten ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, non-uniform versions of all-to-all (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We extend previous complexity results to show that the general asynchronous problems are NP-complete. We present several approximation algorithms and determine which heuristics are best suited to several parallel systems. We conclude with experimental results that show that our algorithms outperform the native all-to-all exchange algorithm on an IBM SP2 when the number of processors is odd.
Communication Operations on Coarse-Grained Mesh Architectures
- Parallel Computing
, 1994
"... In this paper we consider three frequently arising communication operations, one-to-all, all-to-one, and all-to-all. We describe architecture-independent solutions for each operation, as well as solutions tailored towards the mesh architecture. We show how the relationship among the parameters of a ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
In this paper we consider three frequently arising communication operations, one-to-all, all-to-one, and all-to-all. We describe architecture-independent solutions for each operation, as well as solutions tailored towards the mesh architecture. We show how the relationship among the parameters of a parallel machine and the relationship of these parameters to the message size determines the best solution. We discuss performance and scalability issues of our solutions on the Intel Touchstone Delta. Our results show that in order to cover a broad range of scalability for a particular operation, multiple solutions should be employed. Keywords: Parallel processing, coarse-grained machines, communication operations, scalability. Research supported in part by ARPA under contract DABT63-92-C-0022ONR. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing official policies, expressed or implied, of the U.S. government. 1 In...
Portable and scalable algorithms for irregular all-to-all communication
- In 16th ICDCS
, 1996
"... In irregular all-to-all communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algori ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In irregular all-to-all communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message start-ups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm provides deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of high-performance computing (HPC)platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network
Multi-Phase Redistribution: A Communication-Efficient Approach to Array Redistribution
, 1994
"... Distributed-memory implementations of several scientific applications require array redistribution. Array redistribution is used in languages such as High Performance Fortran to dynamically change the distribution of arrays across processors. Performing array redistribution incurs two overheads - an ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Distributed-memory implementations of several scientific applications require array redistribution. Array redistribution is used in languages such as High Performance Fortran to dynamically change the distribution of arrays across processors. Performing array redistribution incurs two overheads - an indexing overhead for determining the set of processors to communicate with and the array elements to be communicated, and a communication overhead for performing the necessary irregular all-to-many personalized communication. In this paper efficient runtime methods for performing array redistribution are presented. To reduce the indexing overhead, precise closed forms for enumerating the processors to communicate with and the array elements to be communicated are developed for two special cases of array redistribution involving blockcyclically distributed arrays. The general array redistribution problem for block-cyclically distributed arrays can be expressed in terms of these special case...
Scalable s-to-p broadcasting on message-passing mpps
- IEEE Transactions on Parallel and Distributed Systems
, 1998
"... Abstract—In s-to-p broadcasting, s processors in a p-processor machine contain a message to be broadcast to all the processors, 1 ≤ s ≤ p. We present a number of different broadcasting algorithms that handle all ranges of s. We show how the performance of each algorithm is influenced by the distribu ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—In s-to-p broadcasting, s processors in a p-processor machine contain a message to be broadcast to all the processors, 1 ≤ s ≤ p. We present a number of different broadcasting algorithms that handle all ranges of s. We show how the performance of each algorithm is influenced by the distribution of the s source processors and by the relationships between the distribution and the characteristics of the interconnection network. For the Intel Paragon we show that for each algorithm and machine dimension there exist ideal distributions and distributions on which the performance degrades. For the Cray T3D we also demonstrate dependencies between distributions and machine sizes. To reduce the dependence of the performance on the distribution of sources, we propose a repositioning approach. In this approach, the initial distribution is turned into an ideal distribution of the target broadcasting algorithm. We report experimental results for the Intel Paragon and Cray T3D and discuss scalability and performance. Index Terms—Broadcasting, communication operations, message-passing MPPs, scalability.

