Results 1  10
of
21
Scheduling Strategies for MasterSlave Tasking on Heterogeneous Processor Grids
, 2002
"... In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous "grid" computing platform. We use a nonoriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different ..."
Abstract

Cited by 102 (33 self)
 Add to MetaCart
In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous "grid" computing platform. We use a nonoriented graph to model a grid, where resources can have different speeds of computation and communication, as well as different overlap capabilities. We show how to determine the optimal steadystate scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor). This result holds for a quite general framework, allowing for cycles and multiple paths in the interconnection graph, and allowing for several masters. Because
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 101 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1995
"... AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the a ..."
Abstract

Cited by 68 (8 self)
 Add to MetaCart
AbstractA collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel amputer products by IBM, has been designed. CCL is pact of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic pointtopoint communication model. Index Terms Collective communication algorithms, collective communication semantics, messagepassing parallel systems, portable library, process group, tunable algorithms. I.
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model
 IEEE Transactions on Parallel and Distributed Systems
, 1996
"... AbstractThere are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter h is used to model the communication latency of the messagepassing system. Each node du ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
AbstractThere are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter h is used to model the communication latency of the messagepassing system. Each node during each round can send a fixedsize message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of hand will arrive at the receiving node at round r + h 1. Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, 1) techniques for measurement of the value of h on a given machine, 2) creating efficient broadcast algorithms that get the latency hand the number of nodes n as parameters and 3) creating efficient global combine algorithms for parallel machines with h which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Intel Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20%. Index TermsBroadcast, global combine, postal model, complete graph, collective communication 1
Nearest Neighbor Algorithms for Load Balancing in Parallel Computers
, 1995
"... With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for shor ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimensionexchange (DE, for short) and the diffusion (DF, for short) methods and their several variantsthe average dimensionexchange (ADE), the optimallytuned dimensionexchange (ODE), the local average diffusion (ADF) and the optimallytuned diffusion (ODF). The measures of interest are their efficiency in driving any initial workload distribution to a uniform distribution and their ability in controlling the growth of the variance among the processors' workloads. The comparison is made with respect to both oneport and allport communication architectures and in consideration of various implementation strategies including synchronous/asynchronous invocation policies and static/dynamic random workload behaviors. It t...
A PolynomialTime Algorithm for Allocating Independent Tasks on Heterogeneous ForkGraphs
, 2002
"... In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous processor farm. The master processor P 0 can process a task within w 0 timeunits; it communicates a task in d i timeunits to the ith slave P i , 1 i p, which requires w i ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
In this paper, we consider the problem of allocating a large number of independent, equalsized tasks to a heterogeneous processor farm. The master processor P 0 can process a task within w 0 timeunits; it communicates a task in d i timeunits to the ith slave P i , 1 i p, which requires w i timeunits to process it. We assume communicationcomputation overlap capabilities for each slave (and for the master), but the communication medium is exclusive: the master can only communicate with a single slave at each timestep. We give a
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n³
Computing Global Combine Operations in the MultiPort Postal Model
, 1996
"... Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequent ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Consider a messagepassing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n distributed pieces of data and to make the result known to all the n processors. This operation is frequently used in many messagepassing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multiport postal model for messagepassing systems. This model is characterized by three parameters: n  the number of processors, k  the number of ports per processor, and  the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent out from k other processors \Gamma 1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of comm...
An Analytical Comparison of Nearest Neighbor Algorithms for Load Balancing in Parallel Computers
, 1995
"... With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimension exchange and the diffusion metho ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages workload migrations within its neighborhood. This paper compares a couple of fairly wellknown nearest neighbor algorithms, the dimension exchange and the diffusion methods and their variants in terms of their performances in both oneport and allport communication architectures. It turns out that the dimension exchange method outperforms the diffusion method in the oneport communication model, and that the strength of the diffusion method is in asynchronous implementations in the allport communication model. The underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the kary ncube. 1 Introduction Massively parallel computers have been shown to be very efficient at solving problems that can be partitioned into tasks with static computation and communication patt...