Results 1  10
of
18
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 84 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
Scheduling Strategies for MasterSlave Tasking on Heterogeneous Processor Grids
, 2002
"... In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous "grid" computing platform. We use a nonoriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different overlap ca ..."
Abstract

Cited by 81 (36 self)
 Add to MetaCart
In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous "grid" computing platform. We use a nonoriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different overlap capabilities. We show how to determine the optimal steadystate scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor). This result holds for a quite general framework, allowing for cycles and multiple paths in the interconnection graph, and allowing for several masters. Because
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1995
"... AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the a ..."
Abstract

Cited by 65 (7 self)
 Add to MetaCart
AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel amputer products by IBM, has been designed. CCL is pact of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic pointtopoint communication model. Index Terms Collective communication algorithms, collective communication semantics, messagepassing parallel systems, portable library, process group, tunable algorithms. I.
Nearest Neighbor Algorithms for Load Balancing in Parallel Computers
, 1995
"... With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for shor ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for short) and the diffusion (DF, for short) methods and their several variantsthe average dimensionexchange (ADE), the optimallytuned dimensionexchange (ODE), the local average diffusion (ADF) and the optimallytuned diffusion (ODF). The measures of interest are their efficiency in driving any initial workload distribution to a uniform distribution and their ability in controlling the growth of the variance among the processors' workloads. The comparison is made with respect to both oneport and allport communication architectures and in consideration of various implementation strategies including synchronous/asynchronous invocation policies and static/dynamic random workload behaviors. It t...
A PolynomialTime Algorithm for Allocating Independent Tasks on Heterogeneous ForkGraphs
, 2002
"... In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous processor farm. The master processor P 0 can process a task within w 0 timeunits; it communicates a task in d i timeunits to the ith slave P i , 1 i p, which requires w i ..."
Abstract

Cited by 15 (9 self)
 Add to MetaCart
In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous processor farm. The master processor P 0 can process a task within w 0 timeunits; it communicates a task in d i timeunits to the ith slave P i , 1 i p, which requires w i timeunits to process it. We assume communicationcomputation overlap capabilities for each slave (and for the master), but the communication medium is exclusive: the master can only communicate with a single slave at each timestep. We give a
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n
Computing Global Combine Operations in the MultiPort Postal Model
, 1996
"... Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequent ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequently used in many messagepassing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multiport postal model for messagepassing systems. This model is characterized by three parameters: n  the number of processors, k  the number of ports per processor, and  the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent out from k other processors \Gamma 1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of comm...
An Analytical Comparison of Nearest Neighbor Algorithms for Load Balancing in Parallel Computers
, 1995
"... With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimension exchange and the diffusion metho ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimension exchange and the diffusion methods and their variants in terms of their performances in both oneport and allport communication architectures. It turns out that the dimension exchange method outperforms the diffusion method in the oneport communication model, and that the strength of the diffusion method is in asynchronous implementations in the allport communication model. The underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the kary ncube. 1 Introduction Massively parallel computers have been shown to be very efficient at solving problems that can be partitioned into tasks with static computation and communication patt...
Scheduling strategies for mixed data and task parallelism on heterogeneous clusters and grids
, 2003
"... ..."
Communication Primitives for Unstructured Finite Element Simulations on Data Parallel Architectures
 Computing Systems in Engineering, 3(1  4):6372
, 1992
"... Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different applications using the finite element method for unstructured grids and sparse solvers with different communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5  3. 1 Introduction The finite element method is a popular technique for solving boundary and initial value problems. Moderate sized engineering problems have been successfully simulated using this technique. The primary bottleneck for the simulation of large problems has been available computational resources. With the advent of massively parallel architectures, simulating significantly larger 1 Also affiliated with the Division of Applied Scien...