Results 1 
9 of
9
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 84 (24 self)
 Add to MetaCart
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Shared Memory Simulations with TripleLogarithmic Delay (Extended Abstract)
, 1995
"... ) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D33095 Paderborn, Germany Abstract. We conside ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D33095 Paderborn, Germany Abstract. We consider the problem of simulating a PRAM on a distributed memory machine (DMM). Our main result is a randomized algorithm that simulates each step of an nprocessor CRCW PRAM on an nprocessor DMM with O(log log log n log n) delay, with high probability. This is an exponential improvement on all previously known simulations. It can be extended to a simulation of an (n log log log n log n) processor EREW PRAM on an nprocessor DMM with optimal delay O(log log log n log n), with high probability. Finally a lower bound of \Omega (log log log n=log log log log n) expected time is proved for a large class of randomized simulations that includes all known simulations. 1 Introduction Para...
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
CONTENTION RESOLUTION IN HASHING BASED SHARED MEMORY SIMULATIONS
, 2000
"... In this paper we study the problem of simulating shared memory on the distributed memory machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. The main aim is to design strategies that resolve contention at th ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
In this paper we study the problem of simulating shared memory on the distributed memory machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. The main aim is to design strategies that resolve contention at the memory modules. Extending results and methods from random graphs and very fast randomized algorithms, we present new simulation techniques that enable us to improve the previously best results exponentially. In particular, we show that an nprocessor CRCW PRAM can be simulated by an nprocessor DMM with delay O(log log log n log ∗ n), with high probability. Next we describe a general technique that can be used to turn these simulations into timeprocessor optimal ones, in the case of EREW PRAMs to be simulated. We obtain a timeprocessor optimal simulation of an (n log log log n log ∗ n)processor EREW PRAM on an nprocessor DMM with delay O(log log log n log ∗ n), with high probability. When an (n log log log n log ∗ n)processor CRCW PRAM is simulated, the delay is only by a log ∗ n factor larger. We further demonstrate that the simulations presented can not be significantly improved using our techniques. We show an Ω(log log log n / log log log log n) lower bound on the expected delay for a class of PRAM simulations, called topological simulations, that covers all previously known simulations as well as the simulations presented in the paper.
Simple Fast Parallel Hashing by Oblivious Execution
 AT&T Bell Laboratories
, 1994
"... A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algorithm uses a novel approach of hashing by "oblivious execution" based on probabilistic analysis to circumvent the parity lower bound barrier at the nearlogarithmic time level. The algorithm is simple and is sketched by the following: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t := 1 to O(lg lg n) do (a) Allocate M t memory blocks, each of size K t . (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters M t and K t . The algorithm uses only O(lg lg...
Fast, Efficient Mutual and Self Simulations for Shared Memory and Reconfigurable Mesh
 in Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing
, 1995
"... This paper studies relations between the parallel random access machine (pram) model, and the reconfigurable mesh (rmesh) model, by providing mutual simulations between the models. We present an algorithm simulating one step of an (n lg lg n) processor crcw pram on an n \Theta n rmesh with delay O ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper studies relations between the parallel random access machine (pram) model, and the reconfigurable mesh (rmesh) model, by providing mutual simulations between the models. We present an algorithm simulating one step of an (n lg lg n) processor crcw pram on an n \Theta n rmesh with delay O(lg lg n) with high probability. We use our pram simulation to obtain the first efficient selfsimulation algorithm of an rmesh with general switches: An algorithm running on an n \Theta n rmesh is simulated on a p \Theta p rmesh with delay O((n=p) 2 + lg n lg lg p) with high probability, which is optimal for all p n= p lg n lg lg n. Finally, we consider the simulation of rmesh on the pram. We show that a 2 \Theta n rmesh can be optimally simulated on a crcw pram in \Theta(ff(n)) time, where ff(\Delta) is the slowgrowing inverse Ackermann function. In contrast, a pram with polynomial number of processors cannot simulate the 3 \Theta n rmesh in less than \Omega\Gammaha n= lg lg n) e...
Modeling Parallel Bandwidth: Local versus Global Restrictions
 ALGORITHMICA
, 1999
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP, LOGP, and QSM) account for bandwidth limitations using a perprocessor parameter g> 1, such that each processor can send/ ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP, LOGP, and QSM) account for bandwidth limitations using a perprocessor parameter g> 1, such that each processor can send/receive at most h messages in g ·h time. Other models (e.g., PRAM(m)) account for bandwidth limitations as an aggregate parameter m < p, such that the p processors can send at most m messages in total at each step. This paper provides the first detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems such as broadcasting, parity, summation, and sorting, and give several new upper and lower time bounds that demonstrate the advantage of globally limited models over locally limited models given the same aggregate bandwidth (i.e., p·1/g = m). In general, globally limited models have a possible advantage whenever there is an imbalance in the number of messages sent/received by the processors. To exploit this advantage, the processors must schedule the sending of messages so as to respect the aggregate bandwidth limit. We present a new parallel scheduling algorithm for globally limited models that enable an unknown, arbitrarily unbalanced set of messages to be sent through the limited bandwidth within a (1+ε) factor of the optimal offline schedule with high probability, even if the penalty for overloading the network is an exponential function of the overload. We also
SIMPLE FAST PARALLEL HASHING BY OBLIVIOUS EXECUTION
"... Abstract. A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a concurren ..."
Abstract
 Add to MetaCart
Abstract. A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a concurrentread concurrentwrite parallel random access machine (crcw pram). Our algorithm uses a novel approach of hashing by \oblivious execution " based on probabilistic analysis. The algorithm is simple and has the following structure: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t: = 1 to O(lg lg n) do (a) Allocate Mt memory blocks, each of size Kt. (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters Mt and Kt. The algorithm uses only O(lg lg n) random words and can be implemented in a workecient manner.