## Pipelining with Futures (1997)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 8 - 5 self |

### BibTeX

@MISC{Blelloch97pipeliningwith,

author = {G. E. Blelloch and M. Reid-Miller},

title = {Pipelining with Futures},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time. Paul, Vishkin, and Wagener (PVW) used the approach in a parallel implementation of 2-3 trees. The approach was later used by Cole in the first O(lg n) time sorting algorithm on the PRAM not based on the AKS sorting network, and has since been used to improve the time of several other algorithms. Although the approach has improved the asymptotic time of many algorithms, there are two practical problems: maintaining the pipeline is quite complicated for the programmer, and the pipelining forces highly synchronous code execution. Synchronous execution is less practical on asynchronous machines and makes it difficult to modify a schedule to use less memory or to take better advantage of locality.

### Citations

1566 | The definition of Standard ML
- MILNER, TOFTE, et al.
- 1990
(Show Context)
Citation Context ...ict on the head but not the second or any other element. We will make significant use of this property in the algorithms in this paper. To describe the algorithms in this paper, we use a subset of ML =-=[27]-=- extended with futures. The syntax is defined in the appendix (see Figure 8). The subset we use is purely functional (no side effects), and we use arrays only for the 2-6 tree described in Section 6 a... |

1132 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ... \Delta Ts(p)) time, where Ts(p) is the time for a scan operation (all-prefix-sums) used for load balancing the tasks. Our implementation also implies time bounds of O(gw=p + d(Ts(p) + L)) on the BSP =-=[30]-=-, O(w=p + d lg p) on an Asynchronous EREW PRAM [20], and O(w=p + d) on the EREW Scan model [9]. The conversion to linear code is a simple manipulation that can be done by a compiler. Although this con... |

447 | Multilisp: A language for concurrent symbolic computation - Halstead - 1985 |

401 | Scheduling multithreaded computations by work stealing, in
- Blumofe, Leiserson
- 1994
(Show Context)
Citation Context ...essors and not by the pipelining itself; in the PRAM the processor allocation needs to be done by the user and often requires significant effort. 2 The Model As with the work of Blumofe and Leiserson =-=[12, 13]-=- we model a computation as a set of threads and the cost as a directed acyclic graph (DAG). Threads can fork new threads using a future, and can synchronize by requesting a value written by another th... |

284 |
Parallel merge sort
- Cole
- 1988
(Show Context)
Citation Context ... n), by pipelining the tasks through the tree. The idea is that when task i is working on level j of the tree, task i + 1 can work on level j \Gamma 1, and so on. A similar idea was then used by Cole =-=[19]-=- to develop the first O(lg n) time sorting algorithm for the PRAM that was not based on the AKS sorting network [2], which has very large constants. The algorithm is based on parallel mergesort, and i... |

281 | Computational interpretation of linear logic
- Abramsky
- 1990
(Show Context)
Citation Context ...near code has been studied extensively in the programming language community in the context of various memory optimizations, such as updating functional data in place or simplifying memory management =-=[26, 31, 5, 1, 18]-=-. Linearizing code does not affect the performance of any of the algorithms we have considered in this paper. For example, consider the body of the split code in Figure 2, lines 4--11. The only variab... |

238 | The parallel evaluation of general arithmetic expressions
- Brent
- 1974
(Show Context)
Citation Context ...ases the work is O(m lg n). To complete the analysis we next consider implementations of the language-based model on various machines. The work and depth costs along with Brent's scheduling principle =-=[14]-=- imply that, given a computation with depth d and work w, there is a schedule of actions onto processors such that the computation will run in w=p+d time on a p processor PRAM. This, however, does not... |

209 |
An O(n log n) sorting network
- Ajtai, Komlós, et al.
- 1983
(Show Context)
Citation Context ...i + 1 can work on level j \Gamma 1, and so on. A similar idea was then used by Cole [19] to develop the first O(lg n) time sorting algorithm for the PRAM that was not based on the AKS sorting network =-=[2]-=-, which has very large constants. The algorithm is based on parallel mergesort, and it uses a parallel merge that takes O(lg n) time. The natural implementation would therefore take O(lg 2 n) time---t... |

193 | Programming parallel algorithms
- Blelloch
- 1996
(Show Context)
Citation Context ...language-based cost models, as opposed to machinebased models, and is an extension of our work on the NESL programming language and its corresponding cost model based on work and depth (summarized in =-=[10]-=-). Acknowledgements We would like to thank Jonathan Hardwick and Girija Narlikar for looking over drafts of this paper and making several useful comments. We would also like to thank Bob Harper for po... |

157 | Scans as Primitive Parallel Operations
- Blelloch
- 1989
(Show Context)
Citation Context ...ad balancing the tasks. Our implementation also implies time bounds of O(gw=p + d(Ts(p) + L)) on the BSP [30], O(w=p + d lg p) on an Asynchronous EREW PRAM [20], and O(w=p + d) on the EREW Scan model =-=[9]-=-. The conversion to linear code is a simple manipulation that can be done by a compiler. Although this conversion can potentially increase the work and/or depth of a computation, it does not for any o... |

137 | Randomized Search Trees
- Aragon, Seidel
- 1989
(Show Context)
Citation Context ...g the same code but implementing it with futures, the depth is reduced to O(lg n), which meets previous depth bounds. The next two algorithms use a parallel implementation of the treap data structure =-=[3]-=-. We show randomized algorithms for insertingsm keys into and deleting m keys from a treap of size n in O(lg n + lg m) expected depth and O(m lg(n=m)) expected work. Like the merge algorithm, the code... |

102 |
Mul-T: A High-Performance Parallel Lisp
- Kranz, Jr, et al.
- 1989
(Show Context)
Citation Context ...uct, alleviating these problems. The futures construct was developed in the late 70s for expressing parallelism in programming languages [21, 6] and has been included in several programming languages =-=[24, 25, 15, 17, 16]-=-. Conceptually the future construct forks a new thread t1 to calculate a value (evaluate an expression) and immediately returns a pointer to where the result of t1 will be written. This pointer can th... |

92 | linear LISP—look ma, no garbage
- Lively
- 1992
(Show Context)
Citation Context ...near code has been studied extensively in the programming language community in the context of various memory optimizations, such as updating functional data in place or simplifying memory management =-=[26, 31, 5, 1, 18]-=-. Linearizing code does not affect the performance of any of the algorithms we have considered in this paper. For example, consider the body of the split code in Figure 2, lines 4--11. The only variab... |

91 |
The linear abstract machine
- Lafont
- 1988
(Show Context)
Citation Context ...s based on ideas from linear logic [22]. In the context of this paper linearizing code implies that whenever a variable is referenced more than once in the code a copy is made implicitly for each use =-=[26]-=-. The copy must be a so-called deep copy which copies the full structure (e.g. if a variable refers to a list, the full list must be copied, not just the head). 3 Linearized code has the property that... |

89 | Is there a use for linear logic
- Wadler
- 1991
(Show Context)
Citation Context ...near code has been studied extensively in the programming language community in the context of various memory optimizations, such as updating functional data in place or simplifying memory management =-=[26, 31, 5, 1, 18]-=-. Linearizing code does not affect the performance of any of the algorithms we have considered in this paper. For example, consider the body of the split code in Figure 2, lines 4--11. The only variab... |

82 | Space-efficient scheduling of multithreaded computations
- Blumofe, Leiserson
- 1998
(Show Context)
Citation Context ...essors and not by the pipelining itself; in the PRAM the processor allocation needs to be done by the user and often requires significant effort. 2 The Model As with the work of Blumofe and Leiserson =-=[12, 13]-=- we model a computation as a set of threads and the cost as a directed acyclic graph (DAG). Threads can fork new threads using a future, and can synchronize by requesting a value written by another th... |

81 | Provably efficient scheduling for languages with fine-grained parallelism
- BLELLOCH, GIBBONS, et al.
- 1999
(Show Context)
Citation Context ...tation spawns n threads and places them in the set of active threads. Since creating n threads could take more than constant time on p processors, they are created lazily using a stub as described in =-=[7]-=----threads are expanded when taken from S instead of when inserted. For each block of p or less threads that are scheduled from the set in a particular step, we can use the scan primitive assumed in t... |

75 | The incremental garbage collection of processes
- Hewitt
- 1977
(Show Context)
Citation Context ...orithms can be automatically pipelined using the futures construct, alleviating these problems. The futures construct was developed in the late 70s for expressing parallelism in programming languages =-=[21, 6]-=- and has been included in several programming languages [24, 25, 15, 17, 16]. Conceptually the future construct forks a new thread t1 to calculate a value (evaluate an expression) and immediately retu... |

75 |
The apram: incorporating asynchrony into the pram model
- Cole, Zajicek
- 1989
(Show Context)
Citation Context ...scan operation (all-prefix-sums) used for load balancing the tasks. Our implementation also implies time bounds of O(gw=p + d(Ts(p) + L)) on the BSP [30], O(w=p + d lg p) on an Asynchronous EREW PRAM =-=[20]-=-, and O(w=p + d) on the EREW Scan model [9]. The conversion to linear code is a simple manipulation that can be done by a compiler. Although this conversion can potentially increase the work and/or de... |

71 | A provable time and space efficient implementation of nesl
- Blelloch, Greiner
- 1996
(Show Context)
Citation Context ...time stamps of the results in order to determine the depth of the computation. The model, as defined here, is basically the PSL (Parallel Speculative -Calculus) [23], augmented with arrays as in NESL =-=[11]-=-. Although the PSL only considered the pure - Calculus with arithmetic operations, the syntactic sugar we have included only affects work and depth by a constant factor. In this paper we are actually ... |

58 |
Cascading divide-and-conquer: a technique for designing parallel algorithms
- Atallah, Cole, et al.
- 1989
(Show Context)
Citation Context ...gorithm that takes O(lg n) time. The basic idea of Cole's mergesort was later used in a technique called cascading divide-andconquer, which improved the time of many computational geometry algorithms =-=[4]-=-. Although pipelining has lead to theoretical improvements in algorithms, from a practical point of view pipelining can be very cumbersome for the programmer---managing the pipeline involves careful t... |

35 | Reference counting as a computational interpretation of linear logic
- Chirimar, Gunter, et al.
- 1996
(Show Context)
Citation Context |

30 |
Pardel dictionaries on 2-3 trees
- Paul, Vishlcin, et al.
- 1983
(Show Context)
Citation Context ... to improve the time of many parallel algorithms for sharedmemory models. Paul, Vishkin and Wagener described a pipelined algorithm for inserting m new keys into a balanced 2-3 tree with n keys in it =-=[28]-=-. They first considered a nonpipelined algorithm that has O(lg m) tasks, each of which takes O(lg n) parallel time (steps), for a total time To appear in the Ninth Annual ACM Symposium on Parallel Alg... |

27 | Space-efficient scheduling of parallelism with synchronization variables
- Blelloch, Gibbons, et al.
- 1997
(Show Context)
Citation Context ...rithms of what happens on what step. This gives freedom to the implementation as to how to schedule the tasks. The implementation, for example, could optimize the schedule for either space efficiency =-=[12, 7, 8]-=- or locality [13]. On a uniprocessor the implementation could run the code in a purely sequential mode without any need for synchronization. We are not yet sure how general the approach is. We have no... |

20 |
A future-based parallel language for a general-purpose highlyparallel computer
- Callahan, Smith
- 1990
(Show Context)
Citation Context ...uct, alleviating these problems. The futures construct was developed in the late 70s for expressing parallelism in programming languages [21, 6] and has been included in several programming languages =-=[24, 25, 15, 17, 16]-=-. Conceptually the future construct forks a new thread t1 to calculate a value (evaluate an expression) and immediately returns a pointer to where the result of t1 will be written. This pointer can th... |

17 | A provably time-efficient parallel implementation of full speculation
- Greiner, Blelloch
- 1999
(Show Context)
Citation Context ...s and analyze the algorithms in this model. We then show universal bounds for implementing the model on various machine models. For the language-based model we use a slight variation of the PSL model =-=[23]-=-. In this model computations are viewed as dynamically unfolding DAGs where each node is a unit of computation (action) and each edge between nodes represents a dependence implied by the language. The... |

15 | Fast set operations using treaps - Blelloch, Reid-Miller - 1998 |

14 |
COOL: a Language for parallel programming
- Chandra, Gupta, et al.
- 1990
(Show Context)
Citation Context ...uct, alleviating these problems. The futures construct was developed in the late 70s for expressing parallelism in programming languages [21, 6] and has been included in several programming languages =-=[24, 25, 15, 17, 16]-=-. Conceptually the future construct forks a new thread t1 to calculate a value (evaluate an expression) and immediately returns a pointer to where the result of t1 will be written. This pointer can th... |

14 | Aspects of applicative programming for parallel processing
- Friedman, Wise
- 1978
(Show Context)
Citation Context ...orithms can be automatically pipelined using the futures construct, alleviating these problems. The futures construct was developed in the late 70s for expressing parallelism in programming languages =-=[21, 6]-=- and has been included in several programming languages [24, 25, 15, 17, 16]. Conceptually the future construct forks a new thread t1 to calculate a value (evaluate an expression) and immediately retu... |

9 |
Early experiences with OLDEN (parallel programming
- Carlisle, Rogers, et al.
- 1993
(Show Context)
Citation Context |

7 | Mul-T: a high-performance parallel Lisp - Krantz, Halstead, et al. - 1989 |

1 |
Expected work to meld two treaps. Unpublished manuscript
- Reid-Miller
- 1996
(Show Context)
Citation Context ...hts of the two treaps is O(lg n) and O(lg m) [3], the expected depth to meld them is O(lg n + lg m). Theorem 4.4 The expected work to meld two treaps of size n and m;m ! n is O(m lg(n=m)). Proof. See =-=[29]-=-. The proof of the depth to merge two trees follows directly from Corollary 4.3. The proof the the work bound for merge is easier than for meld because the input trees are balanced. Meld requires an e... |