## Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) (2008)

Citations: | 2 - 0 self |

### BibTeX

@MISC{Dorrigiv08optimalspeedup,

author = {Reza Dorrigiv and Alejandro López-ortiz and Alejandro Salinger},

title = {Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) },

year = {2008}

}

### OpenURL

### Abstract

Over the last five years, major microprocessor manufacturers have released plans for a rapidly increasing number of cores per microprossesor, with upwards of 64 cores by 2015. In this setting, a sequential RAM computer will no longer accurately reflect the architecture on which algorithms are being executed. In this paper we propose a model of low degree parallelism (LoPRAM) which builds upon the RAM and PRAM models yet better reflects recent advances in parallel (multi-core) architectures. This model supports a high level of abstraction that simplifies the design and analysis of parallel programs. More importantly we show that in many instances it naturally leads to work-optimal parallel algorithms via simple modifications to sequential algorithms.

### Citations

8530 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...hose time complexity T (n) is a recurrence of the form: T (n) = aT (n/b) + f(n), (1) where a ≥ 1 and b > 1 are constants, and f(n) is a nonnegative function. By the Master theorem, T (n) is such that =-=[9]-=-: ⎧ ⎪⎨ Θ(n T (n) = ⎪⎩ logb a ), if f(n) = O(nlogb (a)−ɛ Θ(n ) (Case 1) logb a log n), if f(n) = Θ(nlogb a Θ(f(n)), ) if f(n) = Ω(n (Case 2) logb (a)+ɛ ) and af(n/b) ≤ cf(n), for some c < 1 (Case 3) (2... |

1130 |
A Bridging Model for Parallel Computation
- Valiant
- 1990
(Show Context)
Citation Context ...gn to what could effectively be achieved in practice. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model [3], the bulk-synchronous parallel model =-=[24]-=-, and the Asynchronous PRAM [16], among others [20, 23, 1, 2, 7]. In practice there were various important drawbacks of the PRAM model, such as the cost of synchronization, the cost of interprocessor ... |

635 |
An Introduction to Parallel Algorithms
- JaJa
- 1992
(Show Context)
Citation Context ... n) processors running in multiple-instruction multiple-data (MIMD) mode. The read and write model, while architecture dependent, can generally be assumed to be Concurrent-Read Exclusive-Write (CREW) =-=[15, 17]-=-. To support this model, semaphores and automatic serialization on shared variables are available—either hardware or software based—in a transparent form to the programmer. If an unserialized variable... |

497 | Eicken. Logp: Towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ...ul from a theoretical perspective, proved unrealistic and various attempts were made to refine it in a way that would better align to what could effectively be achieved in practice (see, for example, =-=[12, 3, 21, 17, 19, 20, 1, 2]-=-). In addition to its lack of fidelity, an important drawback of the PRAM is the enormous difficulty in developing and implementing work-optimal algorithms (i.e. linear speedup) for a computer with Θ(... |

286 |
Parallel Merge Sort
- Cole
- 1988
(Show Context)
Citation Context ...bset of problems and show that we can readily obtain optimal speedups. This is in contrast to the PRAM model, in which even a work-optimal sorting algorithm proved to be a difficult research question =-=[8]-=-. More explicitly, we show that a large class of dynamic programming and divide and conquer algorithms can be parallelized using the high level LoPRAM thread model while achieving optimal speedup. Int... |

282 |
Parallelism in random access machines
- Fortune, Wyllie
(Show Context)
Citation Context ...e believe this will be a key factor in the adoption of new parallel computation models. 2 Previous Work The dominant model for previous theoretical research on parallel computations is the PRAM model =-=[13]-=-, which generally assumed Θ(n) processors working synchronously with zero communication delay and often with infinite bandwidth among them. If the number of processors available in practice was smalle... |

252 |
A decomposition theorem for partially ordered sets
- Dilworth
- 1950
(Show Context)
Citation Context ... then move on to the next antichain. A dual of Dilworth’s theorem states that the size of the largest chain in a poset equals the smallest number of antichains into which the poset may be partitioned =-=[11, 21]-=-. Suppose that c1c2 . . . cl is a largest chain in the poset. At step i we process ci together with other elements in its antichain, i.e., elements that are incomparable with ci. These antichains capt... |

237 | The Parallel Evaluation of General Arithmetic Expressions
- Brent
- 1974
(Show Context)
Citation Context ...zero communication delay and often with infinite bandwidth among them. If the number of processors available in practice was smaller, the Θ(n) processor solution could be emulated using Brent’s Lemma =-=[6]-=-. The PRAM model, while fruitful from a theoretical perspective, proved unrealistic and various attempts were made to refine it in a way that would better align to what could effectively be achieved i... |

235 | LogGP: Incorporating long messages into the LogP model for parallel computation
- Alexandrov, Ionescu, et al.
- 1997
(Show Context)
Citation Context ... refine it in a way that would better align to what could effectively be achieved in practice. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model =-=[3]-=-, the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others [20, 23, 1, 2, 7]. In practice there were various important drawbacks of the PRAM model, such as the cost of sy... |

180 |
Efficient Parallel Algorithms
- Gibbons, Rytter
- 1988
(Show Context)
Citation Context ... n) processors running in multiple-instruction multiple-data (MIMD) mode. The read and write model, while architecture dependent, can generally be assumed to be Concurrent-Read Exclusive-Write (CREW) =-=[15, 17]-=-. To support this model, semaphores and automatic serialization on shared variables are available—either hardware or software based—in a transparent form to the programmer. If an unserialized variable... |

136 |
Towards an architectureindependent analysis of parallel algorithms
- Papadimitriou, Yannakakis
- 1990
(Show Context)
Citation Context ...ul from a theoretical perspective, proved unrealistic and various attempts were made to refine it in a way that would better align to what could effectively be achieved in practice (see, for example, =-=[12, 3, 21, 17, 19, 20, 1, 2]-=-). In addition to its lack of fidelity, an important drawback of the PRAM is the enormous difficulty in developing and implementing work-optimal algorithms (i.e. linear speedup) for a computer with Θ(... |

108 |
Randomized and deterministic simulations of PRAMs by parallel machines with restricted granularity of parallel memories
- Mehlhorn, Vishkin
- 1984
(Show Context)
Citation Context ...e. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others =-=[20, 23, 1, 2, 7]-=-. In practice there were various important drawbacks of the PRAM model, such as the cost of synchronization, the cost of interprocessor communication, the cost-effectiveness of a massively parallel ma... |

103 |
Thorsten von Eicken. LogP: towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ...ious attempts were made to refine it in a way that would better align to what could effectively be achieved in practice. Among the alternatives introduced were, to name a few examples, the LogP model =-=[10, 18]-=-, the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others [20, 23, 1, 2, 7]. In practice there were various important drawbacks of the PRAM model, s... |

95 |
Communication complexity of PRAMs
- Aggarwal, Chandra, et al.
- 1990
(Show Context)
Citation Context ...e. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others =-=[20, 23, 1, 2, 7]-=-. In practice there were various important drawbacks of the PRAM model, such as the cost of synchronization, the cost of interprocessor communication, the cost-effectiveness of a massively parallel ma... |

89 |
A more practical PRAM model
- Gibbons
- 1989
(Show Context)
Citation Context ...achieved in practice. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM =-=[16]-=-, among others [20, 23, 1, 2, 7]. In practice there were various important drawbacks of the PRAM model, such as the cost of synchronization, the cost of interprocessor communication, the cost-effectiv... |

78 |
Executing functional programs on a virtual trre L) t processors
- Burton, Sleep
- 1981
(Show Context)
Citation Context ...his is important to avoid potential deadlock). Pending pal-threads are activated in a manner consistent with order of creation as resources become available, in a fashion reminiscent of work stealing =-=[8]-=-. While primitives are provided for ad-hoc ordering of pal-threads activation, by default threads are inserted into an ordered tree. The root of the tree is the main thread, and new threads are attach... |

51 |
Efficient parallel algorithms for string editing and related problems
- Apostolico, Atallah, et al.
- 1990
(Show Context)
Citation Context ...blem itself, and hence a certain degree of parallelism is achievable. In the past parallel versions of certain dynamic programming algorithms have been proposed. In a seminal paper, Apostolico et al. =-=[4]-=- studied parallel algorithms for the string editing problem 7sand other related problems by considering the Directed Acyclic Graph (DAG) corresponding to the problem and computing this graph in parall... |

49 |
On Communication Latency in PRAM Computations
- Aggarwal, Chandra, et al.
(Show Context)
Citation Context ...e. Among the alternatives introduced were, to name a few examples, the LogP model [10, 18], the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others =-=[20, 23, 1, 2, 7]-=-. In practice there were various important drawbacks of the PRAM model, such as the cost of synchronization, the cost of interprocessor communication, the cost-effectiveness of a massively parallel ma... |

46 | Relations between concurrent-write models of parallel computation - Fich, Ragde, et al. - 1988 |

34 | Provably good multicore cache performance for divide-and-conquer algorithms
- Blelloch, Chowdhury, et al.
- 2008
(Show Context)
Citation Context ...iangulation, Polygon triangulation and Convex hull, among others. Experiments for Mergesort and Matrix multiplication implementations in the LoPRAM model are presented in the Appendix. A similar work =-=[5]-=- studies a cache model for multicore computation for a general class of divide-and-conquer algorithms, however, it assumes that the merging phase can always be done in parallel. 4.2 Dynamic Programmin... |

28 | Bulk synchronous parallel computing - a paradigm for transportable software
- Cheatham, Fahmy, et al.
- 1995
(Show Context)
Citation Context |

20 | Cache-efficient dynamic programming algorithms for multicores
- Chowdhury, Ramachandran
- 2008
(Show Context)
Citation Context ...can always be done in parallel. 4.2 Dynamic Programming In the past parallel versions of certain dynamic programming algorithms have been proposed (see, for example, [4], [16], [6], and more recently =-=[9]-=-). Most of these studies provide parallel algorithms that are specific to a few dynamic programming problems, and assume a classical PRAM model with Θ(n) processors. In our case we restrict ourselves ... |

18 | Parallel dynamic programming
- Galil, Park
- 1991
(Show Context)
Citation Context ...lgorithms for the string editing problem 7sand other related problems by considering the Directed Acyclic Graph (DAG) corresponding to the problem and computing this graph in parallel. Galil and Park =-=[14]-=- studied various dynamic programming problems, presenting a unified framework for the parallel computation of these problems using the closure methods and the matrix product methods as general tools f... |

14 |
Friedhelm Meyer auf der Heide. Efficient PRAM simulation on a distributed memory machine. Algorithmica
- Karp, Luby
- 1996
(Show Context)
Citation Context ...ious attempts were made to refine it in a way that would better align to what could effectively be achieved in practice. Among the alternatives introduced were, to name a few examples, the LogP model =-=[10, 18]-=-, the LogGP model [3], the bulk-synchronous parallel model [24], and the Asynchronous PRAM [16], among others [20, 23, 1, 2, 7]. In practice there were various important drawbacks of the PRAM model, s... |

7 | Delayed side-effects ease multi-core programming
- Lokhmotov, Mycroft, et al.
- 2007
(Show Context)
Citation Context ...can be computed ahead of time. Lokhmotov et al. introduced the concept of sieves which are blocks of code in which all side-effects within them are delayed until the end of the scope and side-effects =-=[19]-=-. Such primitives naturally exploit the parallelism present in the sequential program with minimum effort to the programmer. 5 Conclusions We introduced a new model for parallel computation, LoPRAM, t... |

6 | Parallel dynamic programming
- Bradford
- 1994
(Show Context)
Citation Context ...enting a unified framework for the parallel computation of these problems using the closure methods and the matrix product methods as general tools for developing parallel algorithms. Later, Bradford =-=[5]-=- developed a characterization that models dynamic programming tables by graphs, leading to polylogarithmic time algorithms for optimal matrix chain ordering, the optimal construction of binary trees a... |

5 |
A dual of Dilworth’s decomposition theorem
- Mirsky
- 1971
(Show Context)
Citation Context ... then move on to the next antichain. A dual of Dilworth’s theorem states that the size of the largest chain in a poset equals the smallest number of antichains into which the poset may be partitioned =-=[11, 21]-=-. Suppose that c1c2 . . . cl is a largest chain in the poset. At step i we process ci together with other elements in its antichain, i.e., elements that are incomparable with ci. These antichains capt... |

4 |
Prabhakar Ragde and Avi Wigderson, Relations between concurrent-write models of parallel computation
- Fich
- 1988
(Show Context)
Citation Context ... model assumed. For CREW a serialization mechanism is needed to update this value concurrently.This can be done with a log p overhead using standard techniques for simulating a CRCW with an CREW PRAM =-=[12]-=-. The speedup factor in this case is, as noted by Apostolico et al. [4] heavily dependent on the amount of parallelism imbued in the recursive structure of the solution, which we shall discuss in the ... |

4 |
Mihalis Yannakakis. Towards an Architecture-Independent Analysis of Parallel Algorithms
- Papadimitriou
- 1988
(Show Context)
Citation Context |

2 |
Parallel algorithms and serial data structures
- Munro, Robertson
- 1979
(Show Context)
Citation Context ...ssumption of as many as Θ(n) processors being available, there is previous work in the literature considering smaller number of processors for certain specific cases. For example, Munro and Robertson =-=[22]-=- proved in 1979 that a priority queue algorithm with optimal speedup exists so long as p = O(log n). Structure of the paper. In Section 3 we introduce the LoPRAM, a formal model for multicore computin... |