## Low-Contention Depth-First Scheduling of Parallel Computations with Write-Once Synchronization Variables (2001)

### Cached

### Download Links

- [www.mpi-sb.mpg.de]
- [www.cs.uoi.gr]
- [www.cs.uoi.gr]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA |

Citations: | 2 - 0 self |

### BibTeX

@INPROCEEDINGS{Fatourou01low-contentiondepth-first,

author = {Panagiota Fatourou},

title = {Low-Contention Depth-First Scheduling of Parallel Computations with Write-Once Synchronization Variables},

booktitle = {In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA},

year = {2001},

pages = {189--198}

}

### OpenURL

### Abstract

We present an efficient, randomized, online, scheduling algorithm for a large class of programs with write-once synchronization variables. The algorithm combines the workstealing paradigm with the depth-first scheduling technique, resulting in high space efficiency and good time complexity. By automatically increasing the granularity of the work scheduled on each processor, our algorithm achieves good locality, low contention and low scheduling overhead, improving upon a previous depth-first scheduling algorithm [6] published in SPAA'97. Moreover, it is provably efficient for the general class of multithreaded computations with writeonce synchronization variables (as studied in [6]), improving upon algorithm DFDeques (published in SPAA'99 [24]), which is only for the more restricted class of nested parallel computations. More specifically, consider such a computation with work T1 , depth T1 and oe synchronizations, and suppose that space S1 suffices to execute the computation on a singleprocessor computer. Then, on a P-processor shared-memory parallel machine, the expected space complexity of our algorithm is at most S1 +O(PT1 log(PT1 )), and its expected time complexity is O(T1=P+oe log(PT1)=P+T1 log(PT1 )). Moreover, for any ffl ? 0, the space complexity of our algorithm is S1 + O(P (T1 + ln(1=ffl)) log(P (T1 + ln(P (T1 + ln(1=ffl))=ffl)))) with probability at least 1 \Gamma ffl. Thus, even for values of ffl as small as e \GammaT 1 , the space complexity of our algorithm is at most S1 +O(PT1 log(PT1 )) with probability at least 1 \Gamma e \GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1

### Citations

8531 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2009
(Show Context)
Citation Context ...tinue and spawn edges, we have a binary tree rooted at the first instruction of the root thread. The leftto -right depth-first order is then the order of nodes visited by a preorder walk of this tree =-=[13]-=-. Now considering the dependency edges again, the computation is depth-first if none of its dependency edges violate its left-to-right depthfirst order. As in previous work on depth-first scheduling a... |

1879 |
An Introduction to Probability Theory and its Applications
- Feller
- 1967
(Show Context)
Citation Context ...the scheduling iteration and jB t j is the size of the working tree at timestep t; jB t j is a random variable, so that log(jB t j + P ) is also a random variable. We use Jansen inequality (see e.g., =-=[17]-=-) to prove: (1) E(log(jB t j + P )) 2 O(log(PT1 )), and (2) for any ffl ? 0, log(jB t j + P ) 2 O(log(P (T1 + ln(1=ffl))), with probability at least 1 \Gamma ffl. Theorem 4.5 Assume that G is any mult... |

1871 | Randomized Algorithms - MOTWANI, P - 1995 |

446 | Multilisp: A language for concurrent symbolic cornputation
- Halstead
(Show Context)
Citation Context ...og(PT1 )) with probability at least 1 \Gamma e \GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1 Introduction Many parallel programming languages =-=[3, 7, 11, 12, 14, 18, 20, 22, 28]-=- support dynamic threads. The multithreaded model of parallel computation is a general approach to model This work was done while the author was affiliated with the MaxPlanck Institute fur Informatik,... |

395 | Scheduling multithreaded computations by work stealing
- Blumofe, Leiserson
- 1999
(Show Context)
Citation Context ...f a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature =-=[1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27]-=-, a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a)). Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A paralle... |

225 | 1-stmctures: Data structures for parallel computing - Arvind, Pingali - 1989 |

176 | Implementation of a portable nested data-parallel language
- Blelloch, Chatterjee, et al.
- 1993
(Show Context)
Citation Context ...og(PT1 )) with probability at least 1 \Gamma e \GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1 Introduction Many parallel programming languages =-=[3, 7, 11, 12, 14, 18, 20, 22, 28]-=- support dynamic threads. The multithreaded model of parallel computation is a general approach to model This work was done while the author was affiliated with the MaxPlanck Institute fur Informatik,... |

165 | PLAXTON C. G.: Thread scheduling for multiprogrammed multiprocessors
- ARORA, BLUMOFE
(Show Context)
Citation Context ...f a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature =-=[1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27]-=-, a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a)). Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A paralle... |

104 | An analysis of dagconsistent distributed shared-memory algorithms
- Blumofe, Frigo, et al.
- 1996
(Show Context)
Citation Context ...) achieved by work-stealing algorithms [7, 9, 15, 16]. Moreover, depth-first schedulers cope with heap allocation [5] which is more general than the stack-based model assumed in work on work-stealing =-=[7, 8, 9, 15, 16]. However,-=- depthfirst schedulers use a globally ordered centralized data structure of "active" threads and thus they are not as practical as work-stealing schedulers. Especially for fine-grained compu... |

102 |
Mul-T: A High-Performance Parallel Lisp
- Kranz, Jr, et al.
- 1989
(Show Context)
Citation Context ...og(PT1 )) with probability at least 1 \Gamma e \GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1 Introduction Many parallel programming languages =-=[3, 7, 11, 12, 14, 18, 20, 22, 28]-=- support dynamic threads. The multithreaded model of parallel computation is a general approach to model This work was done while the author was affiliated with the MaxPlanck Institute fur Informatik,... |

98 |
Compositional C++: Compositional parallel programming
- Chandy, Kesselman
- 1992
(Show Context)
Citation Context |

97 |
4 report on the Sisal language project
- Feo, Cann, et al.
- 1990
(Show Context)
Citation Context |

88 | Jade: A HighLevel, Machine-Independent Language for Parallel Programming
- Rinard, Scales, et al.
- 1993
(Show Context)
Citation Context |

80 | Provably efficient scheduling for languages with fine-grained parallelism
- Blelloch, Gibbons, et al.
- 1995
(Show Context)
Citation Context ...f a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature =-=[1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27]-=-, a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a)). Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A paralle... |

70 | Executing multithreaded programs efficiently
- Blumofe
- 1995
(Show Context)
Citation Context ...ore complicated data structures, as well as variable memory quota which are logarithmic functions of random variables. Proving our time bound employs new interesting techniques. Blumofe and Leiserson =-=[7, 9]-=- were the first to prove that work-stealing algorithms are both space and time efficient for fully-strict multithreaded computations. Fatourou et al. have presented provably efficient scheduling algor... |

65 |
Resource requirements of dataflow programs
- Culler, Arvind
- 1990
(Show Context)
Citation Context |

40 |
COOL: An object-based language for parallel programming
- Chandra, Gupta, et al.
(Show Context)
Citation Context |

27 | Space-efficient scheduling of parallelism with synchronization variables
- Blelloch, Gibbons, et al.
- 1997
(Show Context)
Citation Context ...g the granularity of the work scheduled on each processor, our algorithm achieves good locality, low contention and low scheduling overhead, improving upon a previous depth-first scheduling algorithm =-=[6]-=- published in SPAA'97. Moreover, it is provably efficient for the general class of multithreaded computations with writeonce synchronization variables (as studied in [6]), improving upon algorithm DFD... |

18 | Scheduling threads for low space requirement and good locality
- Narlikar
(Show Context)
Citation Context ...ver, it is provably efficient for the general class of multithreaded computations with writeonce synchronization variables (as studied in [6]), improving upon algorithm DFDeques (published in SPAA'99 =-=[24]-=-), which is only for the more restricted class of nested parallel computations. More specifically, consider such a computation with work T1 , depth T1 and oe synchronizations, and suppose that space S... |

17 | A provably time-efficient parallel implementation of full speculation - Greiner, Blelloch - 1999 |

15 | Prioritization in parallel symbolic computing
- Kale, Ramkumar, et al.
- 1993
(Show Context)
Citation Context ...ope with the more general class of computations with synchronization variables, our algorithm employs more complicated data structures. Our algorithm uses a 2--3 parallel priority tree data structure =-=[6, 21]-=- and stores at its leaves lists of threads, which are globally ordered according to their depth-first execution order. Each processor owns one of these lists and works on it as long as there are ready... |

15 | Space-efficient implementation of nested parallel languages. Draft (available from the authors
- NARLIKAR, BLELLOCH
- 1996
(Show Context)
Citation Context |

12 | Pthreads for dynamic and irregular parallelism
- Narlikar, Blelloch
- 1998
(Show Context)
Citation Context |

4 |
Guaranteeing good memory bounds for parallel programs
- Burton
- 1996
(Show Context)
Citation Context ...emantics of the corresponding serial programs. A very nice discussion about treating non-determinism under the multithreaded model is provided by Burton in [10, Section 6]. However, all known results =-=[1, 4, 5, 6, 7, 9, 10, 15, 16, 24, 25]-=- in this research area have been proved under the assumption of deterministic computations. This paper is organized as follows. Section 2 presents our model. Our algorithm is presented in Section 3, w... |

4 | Space-Efficient Scheduling for Parallel, Multithreaded Computations
- Narlikar
- 1999
(Show Context)
Citation Context ... scheduler is much better than the one of the algorithm in [6]. Although our algorithm works asynchronously, for the purpose of our analysis we assume, like in all previous work in this research area =-=[1, 4, 5, 6, 7, 9, 15, 16, 24, 25]-=-, that the timesteps are synchronized across all the processors. In this work, we consider only deterministic parallel computations, that is, computations that always produce the same computation grap... |

2 | A new scheduling algorithm for general strict multithreaded computations
- Fatourou, Spirakis
- 1999
(Show Context)
Citation Context |

2 | Efficient scheduling of strict multithreaded computations
- Fatourou, Spirakis
- 2000
(Show Context)
Citation Context ...emantics of the corresponding serial programs. A very nice discussion about treating non-determinism under the multithreaded model is provided by Burton in [10, Section 6]. However, all known results =-=[1, 4, 5, 6, 7, 9, 10, 15, 16, 24, 25]-=- in this research area have been proved under the assumption of deterministic computations. This paper is organized as follows. Section 2 presents our model. Our algorithm is presented in Section 3, w... |