Results 1  10
of
17
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 74 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Can a SharedMemory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract

Cited by 40 (12 self)
 Add to MetaCart
There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the sharedmemory abstraction as an easytouse platform, the bandwidth limitations of current machines have diverted much attention to messagepassing and distributedmemory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a sharedmemory model can serve as an effective bridging model for parallel computation. In particular, can a sharedmemory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing SharedMemory (QSM) model, which accounts for limited communication bandwidth while still providing a simple sharedmemory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple workpreserving emulation of the QSM on both the BSP, and on a related model, the (d, x)BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Parallel Sorting With Limited Bandwidth
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation an ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusiveread variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
Parallel Balanced Allocations
 IN PROCEEDINGS OF THE 8TH ANNUAL ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1996
"... We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a randomized parallel allocation where each ball can choose a constant number of bins. All results hold with high probability. In the static case all m balls arrive at the same time. We analyze for m = n a very simple optimal class of protocols achieving maximum load O i r q log n log log n j if r rounds of communication are allowed. This matches the lower bound of [ACMR95]. Furthermore, we generalize the protocols to the case of m ? n balls. An optimal load of O(m=n) can be achieved using log log n log(m=n) rounds of communication. Hence, for m = n log log n log log log n balls this slackness allows to hide the amount of communication. In the `classical balls into bins' game this op...
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
New Coding Techniques for Improved Bandwidth Utilization
 In Proc. 37th IEEE Symp. on Foundations of Computer Science
, 1998
"... this paper, we introduce a new coding technique for transmitting the XOR of carefully selected patterns of bits to be communicated which greatly reduces bandwidth requirements in some settings. This technique has broader applications. For example, we demonstrate that the coding technique has a surpr ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
(Show Context)
this paper, we introduce a new coding technique for transmitting the XOR of carefully selected patterns of bits to be communicated which greatly reduces bandwidth requirements in some settings. This technique has broader applications. For example, we demonstrate that the coding technique has a surprising application to a simple I/O (Input / Output) complexity problem related to finding the transpose of a matrix. Our main results are developed in the PRAM(m) model, a limited bandwidth PRAM model where p processors communicate through a small globally shared memory of m bits. We provide new algorithms for the problems of sorting and permutation routing. For the concurrent read PRAM(m), as p grows with m
A General Purpose SharedMemory Model For Parallel Computation
, 1997
"... We describe a generalpurpose sharedmemory model for parallel computation, called the qsm [21], which provides a highlevel sharedmemory abstraction for parallel algorithm design, as well as the ability to be emulated in an effective manner on the bsp, a lowerlevel, distributedmemory model. We ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
We describe a generalpurpose sharedmemory model for parallel computation, called the qsm [21], which provides a highlevel sharedmemory abstraction for parallel algorithm design, as well as the ability to be emulated in an effective manner on the bsp, a lowerlevel, distributedmemory model. We present new emulation results that show that very little generality is lost by not having a `gap parameter' at memory.
Parallel Algorithms for Database Operations and a Database Operation for Parallel Algorithms
, 1995
"... This paper establishes some significant links between two areas: (i) relational parallel database systems; and (ii) the design and analysis of parallel algorithms. The paper begins with a fundamental but very simple observation: implementing a Join operation in the context of relational parallel da ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper establishes some significant links between two areas: (i) relational parallel database systems; and (ii) the design and analysis of parallel algorithms. The paper begins with a fundamental but very simple observation: implementing a Join operation in the context of relational parallel database systems is at least as expensive as implementing an arbitrary PRAM computation. Thus, the efficiency with which a given parallel computer can support a parallel relational database where Joins are fairly frequent is strongly related to the efficiency with which that computer can support the PRAM as one of its programmer 's models. The main technical contribution is an efficient parallel algorithm for the Join operation on a model where, in order to use the available bandwidth effectively, communication has to be performed in large blocks. 1 1 Introduction A key performance bottleneck for various database applications on serial computers has been high latency and low bandwidth while ...
CommunicationProcessor Tradeoffs in Limited Resources PRAM
"... We consider a simple restriction of the PRAM model (called PPRAM), where the input is arbitrarily partitioned between a fixed set of p processors and the shared memory is restricted to m cells. This model allows for investigating the tradeoffs / bottlenecks with respect to the communication bandwi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We consider a simple restriction of the PRAM model (called PPRAM), where the input is arbitrarily partitioned between a fixed set of p processors and the shared memory is restricted to m cells. This model allows for investigating the tradeoffs / bottlenecks with respect to the communication bandwidth (modeled by the shared memory size m) and the number of processors p. It is quite simple and allows the design of optimal algorithms without loosing the effect of communication bottlenecks. We have focused on the PPRAM complexity of problems that have O(n) sequential solutions (where n is the input size), and where m ^ p ^ n. We show tight time bounds for several problem in this model such as summing, Boolean threshold, routing, list reversal and kselection. We get typically two sorts of complexity behaviors for these problems: Either ~O(n=p+p=m) which means that the time scales with the number of processors and with memory size (in appropriate range) but not with both. The other is a ~O(n=m) which does not scales with p and reflects a communication bottleneck (as long as m! p). We are not aware of any problem whose complexity scales with both p and m (e.g O ( npm\Delta p)). This might explain why in actual implementations
Compression using efficient multicasting
 Proceedings of the ThirtySecond Annual ACM Symposium on Theory of Computing
, 2000
"... Many multiprocessor systems have the ability to broadcast and/or multicast information efficiently. However, this ability is often overlooked when designing algorithms for these systems. In this paper, we introduce a new compression technique that uses efficient multicasting to significantly redu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Many multiprocessor systems have the ability to broadcast and/or multicast information efficiently. However, this ability is often overlooked when designing algorithms for these systems. In this paper, we introduce a new compression technique that uses efficient multicasting to significantly reduce the amount of information communicated during parallel and distributed computation, resulting in significantly faster algorithms for Fast Fourier Transforms and sorting on shared memory parallel models with limited bandwidth. These algorithms demonstrate the importance of taking advantage of efficient multicasting. The compression technique uses a new, natural variant of Ramsey theory, which may be of independent interest. 1.