Results 1  10
of
14
Efficient PRAM Simulation on a Distributed Memory Machine
 IN PROCEEDINGS OF THE TWENTYFOURTH ACM SYMPOSIUM ON THEORY OF COMPUTING
, 1992
"... We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. The delay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of an m processor PRAM on a n processor DMM will necessarily have delay at least m=n. A randomized simulation is called timeprocessor optimal if the delay is O(m=n) with high probability. Using a novel simulation scheme based on hashing we obtain a timeprocessor optimal simulation with delay O(loglog(n)log (n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: \Theta(log(n)= loglog(n)) for the simulation of an n processor PRAM on an n processor DMM, and \Theta(log(n)) in the case ...
Approximate and Exact Deterministic Parallel Selection
, 1993
"... The selection problem of size n is, given a set of n elements drawn from an ordered universe and an integer k with 1 k n, to identify the kth smallest element in the set. We study approximate and exact selection on deterministic concurrentread concurrentwrite parallel RAMs, where approximate sel ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
The selection problem of size n is, given a set of n elements drawn from an ordered universe and an integer k with 1 k n, to identify the kth smallest element in the set. We study approximate and exact selection on deterministic concurrentread concurrentwrite parallel RAMs, where approximate selection with relative accuracy ? 0 asks for any element whose true rank differs from k by at most n. Our main results are: (1) Exact selection problems of size n can be solved in O(logn=log log n) time with O(n log log n=logn) processors. This running time is the best possible (using only a polynomial number of processors) , and the number of processors is optimal for the given running time (optimal speedup); the best previous algorithm achieves optimal speedup with a running time of O(logn log n=log log n). (2) For all t (log log n) 4 log n, approximate selection problems of size n can be solved in O(t) time with optimal speedup with relative accuracy 2 \Gammat loglog log n=(log logn) ...
Optimal Deterministic Approximate Parallel Prefix Sums and Their Applications
 In Proc. Israel Symp. on Theory and Computing Systems (ISTCS'95
, 1995
"... We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias and Vishkin and improves previous deterministic results obtained by Hagerup and Raman. Furthermore, our results completely match a lower bound obtained recently by Chaudhuri. Our results have many applications. Using them we improve upon the best known time bounds for deterministic approximate selection and for deterministic padded sorting. 1 Introduction The computation of prefix sums is one of the most basic tools in the design of fast parallel algorithms (see Blelloch [9] and J'aJ'a [33]). Prefixsums can be computed in O(logn) time and linear work in the EREW PRAM model (Ladner and Fischer [34]) and in O(log n= log log n) and linear work in the Common CRCW PRAM model (Cole and Vishkin...
An Optimal Randomized Logarithmic Time Connectivity Algorithm for the EREW PRAM
, 1996
"... Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The pr ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The probability that the algorithm will not complete in O(log n) time is o(n \Gammac ) for any c ? 0. 1 Introduction Finding the connected components of an undirected graph is perhaps the most basic algorithmic graph problem. While the problem is trivial in the sequential setting, it seems that elaborate methods should be used to solve the problem efficiently in the parallel setting. A considerable number of researchers investigated the complexity of the problem in various parallel models including, in particular, various members of the PRAM family. In this work we consider the EREW PRAM model, the weakest member of this family, and obtain, for the first time, a parallel connectivity algorith...
Contention Resolution in Hashing Based Shared Memory Simulations
"... In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve cont ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve contention at the memory modules. Developing ideas from random graphs and very fast randomized algorithms, we present new simulation techniques that enable us to improve the previously best results exponentially. Particularly, we show that an nprocessor CRCW PRAM can be simulated by an nprocessor DMM with delay O(log log log n log n), with high probability. Next we show a general technique that can be used to turn these simulations to timeprocessor optimal ones, in the case of EREW PRAMs to be simulated. We obtain a timeprocessor optimal simulation of an (n log log log n log n)processor EREW PRAM on an nprocessor DMM with O(log log log n log n) delay. When a CRCW PRAM with (n...
Randomization helps to perform independent tasks reliably, Random Structures and Algorithms
"... This paper is about algorithms that schedule tasks to be performed in a distributed failureprone environment, when processors communicate by messagepassing, and when tasks are independent and of unit length. The processors work under synchrony and may fail by crashing. Failure patterns are imposed ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
This paper is about algorithms that schedule tasks to be performed in a distributed failureprone environment, when processors communicate by messagepassing, and when tasks are independent and of unit length. The processors work under synchrony and may fail by crashing. Failure patterns are imposed by adversaries. The question how the power of adversaries affects the optimality of randomized algorithmic solutions is among the problems studied. Linearlybounded adversaries may fail up to a constant fraction of the processors. Weaklyadaptive adversaries have to select, prior to the start of an execution, a subset of processors to be failureprone, and then may fail only the selected processors, at arbitrary steps, in the course of the execution. Strongly adaptive adversaries have a total number of failures as the only restriction on failure patterns. The measures of complexity are work, measured as the available processor steps, and communication, measured as the number of pointtopoint messages. A randomized algorithm is developed, that attains both O(n log ∗ n) expected work and O(n log ∗ n) expected communication, against weaklyadaptive linearlybounded adversaries, in the case when the numbers of tasks and processors are both equal to n. This is in contrast with the performance of algorithms against stronglyadaptive linearlybounded adversaries, that has to be Ω(n log n / log log n) in terms of work. Key words: distributed algorithm, randomized algorithm, message passing, crash failures, adaptive adversary, independent tasks, load balancing, lower bound.
Retrieval of scattered information by EREW, CREW, and CRCW PRAMs
 In Proc. 3rd Scand. Workshop on Alg. Theory
, 1992
"... Abstract. The kcompaction problem arises when k out of n cells in an array are nonempty and the contents of these cells must be moved to the first k locations in the array. Parallel algorithms for kcompaction have obvious applications in processor allocation and load balancing; kcompaction is al ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. The kcompaction problem arises when k out of n cells in an array are nonempty and the contents of these cells must be moved to the first k locations in the array. Parallel algorithms for kcompaction have obvious applications in processor allocation and load balancing; kcompaction is also an important subroutine in many recently developed parallel algorithms. We show that any EREW PRAM that solves the kcompaction problem requires Ω ( √ log n) time, even if the number of processors is arbitrarily large and k = 2. On the CREW PRAM, we show that every nprocessor algorithm for kcompaction problem requires Ω(log log n) time, even if k = 2. Finally, we show that O(log k) time can be achieved on the ROBUST PRAM, a very weak CRCW PRAM model.
Tight Bounds for Parallel Randomized Load Balancing
 Computing Research Repository
, 1992
"... We explore the fundamental limits of distributed ballsintobins algorithms, i.e., algorithms where balls act in parallel, as separate agents. This problem was introduced by Adler et al., who showed that nonadaptive and symmetric algorithms cannot reliably perform better than a maximum bin load of Θ ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We explore the fundamental limits of distributed ballsintobins algorithms, i.e., algorithms where balls act in parallel, as separate agents. This problem was introduced by Adler et al., who showed that nonadaptive and symmetric algorithms cannot reliably perform better than a maximum bin load of Θ(loglogn/logloglogn) within the same number of rounds. We present an adaptive symmetric algorithm that achieves a bin load of two in log ∗ n + O(1) communication rounds using O(n) messages in total. Moreover, larger bin loads can be traded in for smaller time complexities. We prove a matching lower bound of (1−o(1))log ∗ n on the time complexity of symmetric algorithms that guarantee small bin loads at an asymptotically optimal message complexity of O(n). The essential preconditions of the proof are (i) a limit of O(n) on the total number of messages sent by the algorithm and (ii) anonymity of bins, i.e., the port numberings of balls are not globally consistent. In order to show that our technique yields indeed tight bounds, we provide for each assumption an algorithm violating it, in turn achieving a constant maximum bin load in constant time. As an application, we consider the following problem. Given a fully connected graph of n nodes, where each node needs to send and receive up to n messages, and in each round each node may send one message over each link, deliver all messages as quickly as possible to their destinations. We give a simple and robust algorithm of time complexity O(log ∗ n) for this task and provide a generalization to the case where all nodes initially hold arbitrary sets of messages. Completing the picture, we give a less practical, but asymptotically optimal algorithm terminating within O(1) rounds. All these bounds hold with high probability.
A Lower Bound for Linear Approximate Compaction
, 1993
"... The approximate compaction problem is: given an input array of n values, each either 0 or 1, place each value in an output array so that all the 1's are in the first (1 + )k array locations, where k is the number of 1's in the input. is an accuracy parameter. This problem is of fundamental importa ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
The approximate compaction problem is: given an input array of n values, each either 0 or 1, place each value in an output array so that all the 1's are in the first (1 + )k array locations, where k is the number of 1's in the input. is an accuracy parameter. This problem is of fundamental importance in parallel computation because of its applications to processor allocation and approximate counting. When is a constant, the problem is called Linear Approximate Compaction (LAC). On the CRCW PRAM model, there is an algorithm that solves approximate compaction in O((log log n) 3 ) time for = 1 log log n , using n (log log n) 3 processors. Our main result shows that this is close to the best possible. Specifically, we prove that LAC requires\Omega\Gamma/48 log n) time using O(n) processors. We also give a tradeoff between and the processing time. For ffl ! 1, and = n ffl , the time required is\Omega\Gamma/25 1 ffl ).
Balanced PRAM Simulations via Moving Threads and Hashing
"... : We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we mov ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we move the lightweight threads. Consequently, the processor load balancing problem reduces to the problem of producing evenly distributed memory references. In PRAM computations, this can be achieved by properly hashing the shared memory into the processor&memory pairs. We describe the idea of moving threads, and show that the moving threads framework provides a natural validation for Brent's theorem in workoptimal PRAM simulation situations on mesh of trees, coated mesh, and OCPC based distributed memory machines (DMMs). We prove that an EREW PRAM computation C requiring work W and time T , can be implemented workoptimally on those pprocessor DMMs with high probability, if W =\Omega (p \De...