Results 1  10
of
24
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 79 (23 self)
 Add to MetaCart
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Shared Memory Simulations with TripleLogarithmic Delay (Extended Abstract)
, 1995
"... ) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D33095 Paderborn, Germany Abstract. We conside ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D33095 Paderborn, Germany Abstract. We consider the problem of simulating a PRAM on a distributed memory machine (DMM). Our main result is a randomized algorithm that simulates each step of an nprocessor CRCW PRAM on an nprocessor DMM with O(log log log n log n) delay, with high probability. This is an exponential improvement on all previously known simulations. It can be extended to a simulation of an (n log log log n log n) processor EREW PRAM on an nprocessor DMM with optimal delay O(log log log n log n), with high probability. Finally a lower bound of \Omega (log log log n=log log log log n) expected time is proved for a large class of randomized simulations that includes all known simulations. 1 Introduction Para...
Optimal Deterministic Approximate Parallel Prefix Sums and Their Applications
 In Proc. Israel Symp. on Theory and Computing Systems (ISTCS'95
, 1995
"... We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias and Vishkin and improves previous deterministic results obtained by Hagerup and Raman. Furthermore, our results completely match a lower bound obtained recently by Chaudhuri. Our results have many applications. Using them we improve upon the best known time bounds for deterministic approximate selection and for deterministic padded sorting. 1 Introduction The computation of prefix sums is one of the most basic tools in the design of fast parallel algorithms (see Blelloch [9] and J'aJ'a [33]). Prefixsums can be computed in O(logn) time and linear work in the EREW PRAM model (Ladner and Fischer [34]) and in O(log n= log log n) and linear work in the Common CRCW PRAM model (Cole and Vishkin...
Efficient Communication Using TotalExchange
"... ... programs using highlevel, generalpurpose, and architectureindependent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, generalpurposeparallelcomputingisindeedpossiblepro ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
... programs using highlevel, generalpurpose, and architectureindependent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, generalpurposeparallelcomputingisindeedpossibleprovided certainconditionsaremet: anexcessoflogicalparallelismin the program,andtheabilityofthetargetarchitectureto efficientlyrealizebalancedcommunication patterns. Thecanonicalexampleofabalancedcommunicationpatternisan hrelation, inwhicheachprocessoristheorigin and destination of at most h messages. A plethoraofprotocolshasbeendesigned forrouting hrelations inavarietyofnetworks. Thegoalhasbeentominimizethevalueofhwhile guaranteeingdeliveryofthemessageswithintime aconstantfactorfromoptimal.Inthispaperwe describeprotocolsthatmeetthemoststringent efficiency requirement, namely deliveryofmessages withintimethatisalowerorderadditivetermfrom thebestachievable. Suchprotocolsarecalled 1optimal. Whiletheseprotocolsachieve1optimality only forheavilyloadednetworks,thatis,for largevaluesofh, theyareremarkablefortheirsimplicityinthattheyonly usethetotalexchange communication primitive. The totalexchange canberealizedinmanynetworksusingverysimple, contentionfree,andextremely efficient schemes. Thetechnicalcontributionofthispaperisaprotocol torouterandomhrelationsinan Nprocessor networkusing hN(1+o(1))+O(loglogN) totalexchange roundswithhighprobability. Usingmessageduplication, wecanimprovetheboundto hN(1+o(1))+O(logN). This improves upon the hN(1+o(1))+O(logN) bound of Gerbessiotis and Valiant. While our theoretical improvements are modest, our experimental results show an improvement over the protocol of Gerebessiotis and Valiant.
Contention Resolution in Hashing Based Shared Memory Simulations
"... In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve cont ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve contention at the memory modules. Developing ideas from random graphs and very fast randomized algorithms, we present new simulation techniques that enable us to improve the previously best results exponentially. Particularly, we show that an nprocessor CRCW PRAM can be simulated by an nprocessor DMM with delay O(log log log n log n), with high probability. Next we show a general technique that can be used to turn these simulations to timeprocessor optimal ones, in the case of EREW PRAMs to be simulated. We obtain a timeprocessor optimal simulation of an (n log log log n log n)processor EREW PRAM on an nprocessor DMM with O(log log log n log n) delay. When a CRCW PRAM with (n...
Improved Optimal Shared Memory Simulations, and the Power of Reconfiguration
 In Proceedings of the 3rd Israel Symposium on Theory of Computing and Systems
"... We present timeprocessor optimal randomized algorithms for simulating a shared memory machine (EREW PRAM) on a distributed memory machine (DMM). The first algorithm simulates each step of an nprocessor EREW PRAM on an nprocessor DMM with O( log log n log log log n ) delay with high probability. ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
We present timeprocessor optimal randomized algorithms for simulating a shared memory machine (EREW PRAM) on a distributed memory machine (DMM). The first algorithm simulates each step of an nprocessor EREW PRAM on an nprocessor DMM with O( log log n log log log n ) delay with high probability. This simulation is work optimal and can be made timeprocessor optimal. The best previous optimal simulations require O(log log n) delay. We also study reconfigurable DMMs which are a "complete network version " of the well studied reconfigurable meshes. We show an algorithm that simulates each step of an n processor EREW PRAM on an nprocessor reconfigurable DMM with only O(log n) delay with high probability. We further show how to make this simulation timeprocessor optimal. 1 Introduction Parallel machines that communicate via a shared memory (Parallel Random Access Machines, PRAMs) are the most commonly used machine model for describing parallel algorithms [J92]. The PRAM is relative...
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Lower Bounds for Randomized Exclusive Write PRAMs
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, (ACM
, 1995
"... In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations. First we examine a compaction problem on both the CREW and EREW PRAM models, and w ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations. First we examine a compaction problem on both the CREW and EREW PRAM models, and we present randomized lower bounds which match the best deterministic lower bounds known. (For the CREW PRAM model, the lower bound is asymptotically optimal.) These are the first nontrivial randomized lower bounds known for the compaction problem on these models. We show that our lower bounds also apply to the problem of approximate compaction. Next we examine the problem of computing boolean functions on the CREW PRAM model, and we present a randomized lower bound which improves on the previous best randomized lower bound for many boolean functions, including the OR function. (The previous lower bounds for these functions were asymptotically optimal, but we improve the constant multiplicat...
Buckets strike back: Improved Parallel ShortestPaths
 Proc. 16th Intl. Par. Distr. Process. Symp. (IPDPS
, 2002
"... We study the averagecase complexity of the parallel singlesource shortestpath (SSSP) problem, assuming arbitrary directed graphs with n nodes, m edges, and independent random edge weights uniformly distributed in [0; 1]. We provide a new bucketbased parallel SSSP algorithm that runs in T = O(log ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We study the averagecase complexity of the parallel singlesource shortestpath (SSSP) problem, assuming arbitrary directed graphs with n nodes, m edges, and independent random edge weights uniformly distributed in [0; 1]. We provide a new bucketbased parallel SSSP algorithm that runs in T = O(log 2 n min i f2 i L + jV i jg) averagecase time using O(n+m+T ) work on a PRAM where L denotes the maximum shortestpath weight and jV i j is the number of graph vertices with indegree at least 2 i . All previous algorithms either required more time or more work. The minimum performance gain is a logarithmic factor improvement; on certain graph classes, accelerations by factors of more than n 0:4 can be achieved. The algorithm allows adaptation to distributed memory machines, too.