## Optimal Schedules for Parallel Prefix Computation with Bounded Resources (1991)

Venue: | Proceeding of Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming |

Citations: | 11 - 6 self |

### BibTeX

@INPROCEEDINGS{Nicolau91optimalschedules,

author = {Alexandru Nicolau and Haigeng Wang},

title = {Optimal Schedules for Parallel Prefix Computation with Bounded Resources},

booktitle = {Proceeding of Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},

year = {1991},

pages = {21--24},

publisher = {ACM Press}

}

### OpenURL

### Abstract

Given x 1 ; . . . ; xN , parallel prefix computes x 1 ffi x 2 ffi . . . ffi x k , for 1 k N , with associative operation ffi. We show optimal schedules for parallel prefix computation with a fixed number of resources p 2 for a prefix of size N p(p + 1)=2 . The time of the optimal schedules with p resources is d2N=(p + 1)e for N p(p + 1)=2, which we prove to be the strict lower bound(i.e., which is what can be achieved maximally). We then present a pipelined form of optimal schedules with d2N=(p + 1)e + d(p 0 1)=2e 0 1 time, which takes a constant overhead of d(p 0 1)=2e time more than the optimal schedules. Parallel prefix is an important common operation in many algorithms including the evaluation of polynomials, general Hornor expressions, carry look-ahead circuits and ranking and packing problems. A most important application of parallel prefix is loop parallelizing transformation. 1 Introduction Given x 1 ; . . . ; xN , parallel prefix computes x 1 ffi x 2 ffi . . . ffi x...

### Citations

304 |
Advanced compiler optimizations for supercomputers
- Padua, Wolfe
- 1986
(Show Context)
Citation Context ...rry lookahead circuits and ranking and packing problems. One of its most important applications is loop parallelization. Loop parallelization techniques have been extensively studied. These techniques=-=[10, 2]-=- have demonstrated good performance subject to preserving dependences. To understand how to parallelize loops with loop-carried true dependence beyond these techniques, it is essential to understand t... |

275 | Parallel Prefix Computation
- Ladner, Fisher
- 1980
(Show Context)
Citation Context ...ctor) even with enough processors, and its requirement for a number of processors that is a significant fraction of problem size becomes unrealistic when the problem size increases. Ladner and Fischer=-=[7]-=- found optimal circuits for a prefix of size N = 2 k for k ? 0 assuming enough processors are available. Two problems have since been open: finding optimal circuits for N not a power of two and findin... |

119 | Optimal loop parallelization
- Aiken, Nicolau
(Show Context)
Citation Context ...ratio in every p steps in an iteration. If a schedule can achieve this ratio in fewer steps than p, then it has a smaller loop body than H and is still optimal. Applying the idea of Perfect Pipelining=-=[2, 8]-=-, we can indeed derive a simpler, more concise and more program-space efficient schedule than H called Pipelined Schedule or schedule P . Schedule P has in its loop body one statement with (p+1)=2 fin... |

61 | Perfect Pipelining: A new loop parallelization technique
- AIKEN, NICOLAU
- 1988
(Show Context)
Citation Context ...rry lookahead circuits and ranking and packing problems. One of its most important applications is loop parallelization. Loop parallelization techniques have been extensively studied. These techniques=-=[10, 2]-=- have demonstrated good performance subject to preserving dependences. To understand how to parallelize loops with loop-carried true dependence beyond these techniques, it is essential to understand t... |

53 |
The Power of Parallel Prefix
- Kruskal, Rudolph, et al.
- 1985
(Show Context)
Citation Context ...e also concerned with other properties of these optimal schedules, such as the clarity, simplicity of implementation, and extendibility. We assume the parallel random access(PRAM) com1 putation model =-=[4, 5]-=-: a PRAM consists of p autonomous processors, executing synchronously, all having access to a common memory. We consider two practical PRAM models, concurrent read, exclusive write(CREW) and exclusive... |

45 |
The Structure of Computers and
- Kuck
- 1978
(Show Context)
Citation Context ...d. We then present a simpler, more concise and more program-space efficient schedule than the previous one with d2N=(p+ 1)e + (p + 1)=2 01 time. All these schedules run on CREW machines. Chen and Kuck=-=[3, 6]-=- showed that linear recurrences can be computed in 1 + 2 log N with psN=2. This result is not optimal(by a constant factor) even with enough processors, and its requirement for a number of processors ... |

41 |
Faster Optimal Parallel Prefix Sums and List Ranking
- Cole, Vishkin
- 1989
(Show Context)
Citation Context ...e also concerned with other properties of these optimal schedules, such as the clarity, simplicity of implementation, and extendibility. We assume the parallel random access(PRAM) com1 putation model =-=[4, 5]-=-: a PRAM consists of p autonomous processors, executing synchronously, all having access to a common memory. We consider two practical PRAM models, concurrent read, exclusive write(CREW) and exclusive... |

3 |
Speedup of Iterative Programs in Multiprocessing Systems
- Chen
- 1975
(Show Context)
Citation Context ...d. We then present a simpler, more concise and more program-space efficient schedule than the previous one with d2N=(p+ 1)e + (p + 1)=2 01 time. All these schedules run on CREW machines. Chen and Kuck=-=[3, 6]-=- showed that linear recurrences can be computed in 1 + 2 log N with psN=2. This result is not optimal(by a constant factor) even with enough processors, and its requirement for a number of processors ... |