Results 1  10
of
67
Scheduling Multithreaded Computations by Work Stealing
"... This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computa ..."
Abstract

Cited by 398 (38 self)
 Add to MetaCart
This paper studies the problem of efficiently scheduling fully strict (i.e., wellstructured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMDstyle computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good workstealing scheduler for multithreaded computations with dependencies. Specifically,
Thread scheduling for multiprogrammed multiprocessors
 In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta
, 1998
"... We present a userlevel thread scheduler for sharedmemory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at userlevel and schedules threads onto a fixed collection of processes, while below, the opera ..."
Abstract

Cited by 166 (5 self)
 Add to MetaCart
We present a userlevel thread scheduler for sharedmemory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at userlevel and schedules threads onto a fixed collection of processes, while below, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a nonblocking implementation of the workstealing algorithm. For any multithreaded computation with work ¢¤ £ and criticalpath length ¢¦ ¥ , and for any number § of processes, our scheduler executes the computation in expected time ¨�©�¢�£���§¤����¢�¥�§���§¤�� � , where §� � is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever § is small relative to the parallelism 1
An analysis of dagconsistent distributed sharedmemory algorithms
 in Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1996
"... In this paper, we analyze the performance of parallel multithreaded algorithms that use dagconsistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKE ..."
Abstract

Cited by 105 (22 self)
 Add to MetaCart
In this paper, we analyze the performance of parallel multithreaded algorithms that use dagconsistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C)of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C)is the total work of the computation including page faults, T ∞ is its criticalpath length excluding page faults, and m is the minimum page transfer time. As
A provable time and space efficient implementation of nesl
 In International Conference on Functional Programming
, 1996
"... In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementa ..."
Abstract

Cited by 70 (7 self)
 Add to MetaCart
In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementation bounds for functional languages by considering space and by including arrays. For modeling the cost of NESL we augment a standard callbyvalue operational semantics to return two cost measures: a DAG representing the sequential dependence in the computation, and a measure of the space taken by a sequential implementation. We show that a NESL program with w work (nodes in the DAG), d depth (levels in the DAG), and s sequential space can be implemented on a p processor butterfly network, hypercube, or CRCW PRAM usin O(w/p + d log p) time and 0(s + dp logp) reachable space. For programs with sufficient parallelism these bounds are optimal in that they give linew speedup and use space within a constant factor of the sequential space. 1
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 68 (14 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
Provably good multicore cache performance for divideandconquer algorithms
 In Proc. 19th ACMSIAM Sympos. Discrete Algorithms
, 2008
"... This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, t ..."
Abstract

Cited by 35 (12 self)
 Add to MetaCart
This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicorecache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptoticallyoptimal results for any multicore model. Finally, we show that a separatorbased algorithm for sparsematrixdensevectormultiply achieves provably good cache performance in the multicorecache model, as well as in the wellstudied sequential cacheoblivious model.
Effectively Sharing a Cache Among Threads
 IN SPAA ’04: PROCEEDINGS OF THE SIXTEENTH ANNUAL ACM SYMPOSIUM ON PARALLELISM IN
, 2004
"... We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp . We show that for any computation, and with an appropriate (greedy) ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp . We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp C1 + pd then Mp M1 . The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the singleprocessor cache, and gives some theoretical justification for designing machines with shared caches. We model
Optimizing Parallel Applications for WideArea Clusters
 IN IPPS98 INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1998
"... Recent developments in networking technology cause a growing interest in connecting localarea clusters of workstations over widearea links, creating multilevel clusters, or meta computers. Often, latency and bandwidth of localarea and widearea networks differ by two orders of magnitude or more. ..."
Abstract

Cited by 34 (13 self)
 Add to MetaCart
Recent developments in networking technology cause a growing interest in connecting localarea clusters of workstations over widearea links, creating multilevel clusters, or meta computers. Often, latency and bandwidth of localarea and widearea networks differ by two orders of magnitude or more. One would expect only very coarse grain applications to achieve good performance. To test this intuition, we analyze the behavior of several existing mediumgrain applications on a widearea multicluster. We find that high performance can be obtained if the programs are optimized to take the multilevel network structure into account. The optimizations reduce intercluster traffic and hide intercluster latency, and substantially improve performance on widearea multiclusters. As a result, the range of metacomputing applications is larger than previously assumed.