Results 1 - 10
of
24
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 68 (22 self)
- Add to MetaCart
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Shared Memory Simulations with Triple-Logarithmic Delay (Extended Abstract)
, 1995
"... ) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D-33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D-33095 Paderborn, Germany Abstract. We conside ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
) Artur Czumaj 1 , Friedhelm Meyer auf der Heide 2 , and Volker Stemann 1 1 Heinz Nixdorf Institute, University of Paderborn, D-33095 Paderborn, Germany 2 Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, D-33095 Paderborn, Germany Abstract. We consider the problem of simulating a PRAM on a distributed memory machine (DMM). Our main result is a randomized algorithm that simulates each step of an n-processor CRCW PRAM on an n-processor DMM with O(log log log n log n) delay, with high probability. This is an exponential improvement on all previously known simulations. It can be extended to a simulation of an (n log log log n log n)- processor EREW PRAM on an n-processor DMM with optimal delay O(log log log n log n), with high probability. Finally a lower bound of \Omega (log log log n=log log log log n) expected time is proved for a large class of randomized simulations that includes all known simulations. 1 Introduction Para...
Optimal Deterministic Approximate Parallel Prefix Sums and Their Applications
- In Proc. Israel Symp. on Theory and Computing Systems (ISTCS'95
, 1995
"... We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias and Vishkin and improves previous deterministic results obtained by Hagerup and Raman. Furthermore, our results completely match a lower bound obtained recently by Chaudhuri. Our results have many applications. Using them we improve upon the best known time bounds for deterministic approximate selection and for deterministic padded sorting. 1 Introduction The computation of prefix sums is one of the most basic tools in the design of fast parallel algorithms (see Blelloch [9] and J'aJ'a [33]). Prefix-sums can be computed in O(logn) time and linear work in the EREW PRAM model (Ladner and Fischer [34]) and in O(log n= log log n) and linear work in the Common CRCW PRAM model (Cole and Vishkin...
Efficient Communication Using Total-Exchange
"... ... programs using high-level, general-purpose, and architecture-independent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, general-purposeparallelcomputingisindeedpossiblepro ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
... programs using high-level, general-purpose, and architecture-independent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, general-purposeparallelcomputingisindeedpossibleprovided certainconditionsaremet: anexcessoflogicalparallelismin the program,andtheabilityofthetargetarchitectureto efficientlyrealizebalancedcommunication patterns. Thecanonicalexampleofabalancedcommunicationpatternisan h-relation, inwhicheachprocessoristheorigin and destination of at most h messages. A plethoraofprotocolshasbeendesigned forrouting h-relations inavarietyofnetworks. Thegoalhasbeentominimizethevalueofhwhile guaranteeingdeliveryofthemessageswithintime aconstantfactorfromoptimal.Inthispaperwe describeprotocolsthatmeetthemoststringent efficiency requirement, namely deliveryofmessages withintimethatisalowerorderadditivetermfrom thebestachievable. Suchprotocolsarecalled 1-optimal. Whiletheseprotocolsachieve1-optimality only forheavilyloadednetworks,thatis,for largevaluesofh, theyareremarkablefortheirsimplicityinthattheyonly usethetotal-exchange communication primitive. The total-exchange canberealizedinmanynetworksusingverysimple, contention-free,andextremely efficient schemes. Thetechnicalcontributionofthispaperisaprotocol torouterandomh-relationsinan N-processor networkusing hN(1+o(1))+O(loglogN) total-exchange roundswithhighprobability. Usingmessageduplication, wecanimprovetheboundto hN(1+o(1))+O(logN). This improves upon the hN(1+o(1))+O(logN) bound of Gerbessiotis and Valiant. While our theoretical improvements are modest, our experimental results show an improvement over the protocol of Gerebessiotis and Valiant.
Contention Resolution in Hashing Based Shared Memory Simulations
"... In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve cont ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In this paper we study the problem of simulating shared memory on the Distributed Memory Machine (DMM). Our approach uses multiple copies of shared memory cells, distributed among the memory modules of the DMM via universal hashing. Thus the main problem is to design strategies that resolve contention at the memory modules. Developing ideas from random graphs and very fast randomized algorithms, we present new simulation techniques that enable us to improve the previously best results exponentially. Particularly, we show that an n-processor CRCW PRAM can be simulated by an n-processor DMM with delay O(log log log n log n), with high probability. Next we show a general technique that can be used to turn these simulations to time-processor optimal ones, in the case of EREW PRAMs to be simulated. We obtain a time-processor optimal simulation of an (n log log log n log n)-processor EREW PRAM on an n-processor DMM with O(log log log n log n) delay. When a CRCW PRAM with (n...
Improved Optimal Shared Memory Simulations, and the Power of Reconfiguration
- In Proceedings of the 3rd Israel Symposium on Theory of Computing and Systems
"... We present time-processor optimal randomized algorithms for simulating a shared memory machine (EREW PRAM) on a distributed memory machine (DMM). The first algorithm simulates each step of an n-processor EREW PRAM on an n-processor DMM with O( log log n log log log n ) delay with high probability. ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
We present time-processor optimal randomized algorithms for simulating a shared memory machine (EREW PRAM) on a distributed memory machine (DMM). The first algorithm simulates each step of an n-processor EREW PRAM on an n-processor DMM with O( log log n log log log n ) delay with high probability. This simulation is work optimal and can be made timeprocessor optimal. The best previous optimal simulations require O(log log n) delay. We also study reconfigurable DMMs which are a "complete network version " of the well studied reconfigurable meshes. We show an algorithm that simulates each step of an n- processor EREW PRAM on an n-processor reconfigurable DMM with only O(log n) delay with high probability. We further show how to make this simulation time-processor optimal. 1 Introduction Parallel machines that communicate via a shared memory (Parallel Random Access Machines, PRAMs) are the most commonly used machine model for describing parallel algorithms [J92]. The PRAM is relative...
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications t ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimate-focusing, as well ...
Lower Bounds for Randomized Exclusive Write PRAMs
- in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, (ACM
, 1995
"... In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations. First we examine a compaction problem on both the CREW and EREW PRAM models, and w ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations. First we examine a compaction problem on both the CREW and EREW PRAM models, and we present randomized lower bounds which match the best deterministic lower bounds known. (For the CREW PRAM model, the lower bound is asymptotically optimal.) These are the first non-trivial randomized lower bounds known for the compaction problem on these models. We show that our lower bounds also apply to the problem of approximate compaction. Next we examine the problem of computing boolean functions on the CREW PRAM model, and we present a randomized lower bound which improves on the previous best randomized lower bound for many boolean functions, including the OR function. (The previous lower bounds for these functions were asymptotically optimal, but we improve the constant multiplicat...
Simulating shared memory in real time: On the computation power of reconfigurable meshes
- in ``Proceedings of the 2nd IEEE Workshop on Reconfigurable Architectures
, 1995
"... We consider randomized simulations of shared memory on a distributed memory machine (DMM) where the n processors and the n memory modules of the DMM are connected via a reconfigurable architecture. We first present a randomized simulation of a CRCW PRAM on a reconfigurable DMM having a complete reco ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
We consider randomized simulations of shared memory on a distributed memory machine (DMM) where the n processors and the n memory modules of the DMM are connected via a reconfigurable architecture. We first present a randomized simulation of a CRCW PRAM on a reconfigurable DMM having a complete reconfigurable interconnection. It guarantees delay O(log *n), with high probability. Next we study a reconfigurable mesh DMM (RM-DMM). Here the n processors and n modules are connected via an n_n reconfigurable mesh. It was already known that an n_m reconfigurable mesh can simulate in constant time an n-processor CRCW PRAM with shared memory of size m. In this paper we present a randomized step by step simulation of a CRCW PRAM with arbitrarily large shared memory on an RM-DMM. It guarantees constant delay with high probability, i.e., it simulates in real time. Finally we prove a lower bound showing that size 0(n 2) for the reconfigurable mesh is necessary for real time simulations.] 1997 Academic Press * Supported by DFG-Graduiertenkolleg ``Parallele Rechnernetzwerke in der Produktionstechnik,''

