Results 1  10
of
16
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 68 (14 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
Carbon: architectural support for finegrained parallelism on chip multiprocessors
 In ISCA ’07: Proceedings of the 34th annual international symposium on Computer architecture
, 2007
"... ABSTRACT Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
ABSTRACT Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order
Provably good multicore cache performance for divideandconquer algorithms
 In Proc. 19th ACMSIAM Sympos. Discrete Algorithms
, 2008
"... This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, t ..."
Abstract

Cited by 35 (12 self)
 Add to MetaCart
This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicorecache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptoticallyoptimal results for any multicore model. Finally, we show that a separatorbased algorithm for sparsematrixdensevectormultiply achieves provably good cache performance in the multicorecache model, as well as in the wellstudied sequential cacheoblivious model.
Effectively Sharing a Cache Among Threads
 IN SPAA ’04: PROCEEDINGS OF THE SIXTEENTH ANNUAL ACM SYMPOSIUM ON PARALLELISM IN
, 2004
"... We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp . We show that for any computation, and with an appropriate (greedy) ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp . We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp C1 + pd then Mp M1 . The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the singleprocessor cache, and gives some theoretical justification for designing machines with shared caches. We model
A Parallel, Multithreaded Decision Tree Builder
, 1998
"... Parallelization has become a popular mechanism to speed up data classification tasks that deal with large amounts of data. This paper describes a highlevel, finegrained parallel formulation of a decision treebased classifier for memoryresident datasets on SMPs. We exploit two levels of dividean ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Parallelization has become a popular mechanism to speed up data classification tasks that deal with large amounts of data. This paper describes a highlevel, finegrained parallel formulation of a decision treebased classifier for memoryresident datasets on SMPs. We exploit two levels of divideandconquer parallelism in the tree builder: at the outer level across the tree nodes, and at the inner level within each tree node. Lightweight Pthreads are used to express this highly irregular and dynamic parallelism in a natural manner. The task of scheduling the threads and balancing the load is left to a spaceefficient Pthreads scheduler. Experimental results on large datasets indicate that the space and time performance of the tree builder scales well with both the data size and number of processors. This research is supported by ARPA Contract No. DABT6396C0071. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyri...
Case studies: Memory behavior of multithreaded multimedia and AI applications
 Proc. of 7th Workshop on Computer Architecture Evaluation using Commercial Workloads
, 2004
"... Memory performance becomes a dominant factor for today’s microprocessor applications. In this paper, we study memory reference behavior of emerging multimedia and AI applications. We compare memory performance for sequential and multithreaded versions of the applications on multithreaded processors. ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Memory performance becomes a dominant factor for today’s microprocessor applications. In this paper, we study memory reference behavior of emerging multimedia and AI applications. We compare memory performance for sequential and multithreaded versions of the applications on multithreaded processors. The methodology we used including workload selection and parallelization, benchmarking and measurement, memory trace collection and verification, and tracedriven memory performance simulations. The results from the case studies show that opposite reference behavior, either constructive or disruptive, could be a result for different programs. Care must be taken to make sure the disruptive memory references will not outweigh the benefit of parallelization. 1.
LowContention DepthFirst Scheduling of Parallel Computations with WriteOnce Synchronization Variables
 In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA
, 2001
"... We present an efficient, randomized, online, scheduling algorithm for a large class of programs with writeonce synchronization variables. The algorithm combines the workstealing paradigm with the depthfirst scheduling technique, resulting in high space efficiency and good time complexity. By auto ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present an efficient, randomized, online, scheduling algorithm for a large class of programs with writeonce synchronization variables. The algorithm combines the workstealing paradigm with the depthfirst scheduling technique, resulting in high space efficiency and good time complexity. By automatically increasing the granularity of the work scheduled on each processor, our algorithm achieves good locality, low contention and low scheduling overhead, improving upon a previous depthfirst scheduling algorithm [6] published in SPAA'97. Moreover, it is provably efficient for the general class of multithreaded computations with writeonce synchronization variables (as studied in [6]), improving upon algorithm DFDeques (published in SPAA'99 [24]), which is only for the more restricted class of nested parallel computations. More specifically, consider such a computation with work T1 , depth T1 and oe synchronizations, and suppose that space S1 suffices to execute the computation on a singleprocessor computer. Then, on a Pprocessor sharedmemory parallel machine, the expected space complexity of our algorithm is at most S1 +O(PT1 log(PT1 )), and its expected time complexity is O(T1=P+oe log(PT1)=P+T1 log(PT1 )). Moreover, for any ffl ? 0, the space complexity of our algorithm is S1 + O(P (T1 + ln(1=ffl)) log(P (T1 + ln(P (T1 + ln(1=ffl))=ffl)))) with probability at least 1 \Gamma ffl. Thus, even for values of ffl as small as e \GammaT 1 , the space complexity of our algorithm is at most S1 +O(PT1 log(PT1 )) with probability at least 1 \Gamma e \GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1
Efficient Scheduling of Strict Multithreaded Computations
, 1999
"... In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synch ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we study the problem of eciently scheduling a wide class of multithreaded computations, called strict; that is, computations in which all dependencies from a thread go to the thread's ancestors in the computation tree. Strict multithreaded computations allow the limited use of synchronization primitives. We present the rst fully distributed scheduling algorithm which applies to any strict multithreaded computation. The algorithm is asynchronous, online and follows the workstealing paradigm. We prove that our algorithm is ecient not only in terms of its memory requirements and its execution time, but also in terms of its communication complexity. Our analysis applies to both shared and distributed memory machines. More specically, the expected execution time of our algorithm is O(T 1 =P +hT1 ), where T 1 is the minimum serial execution time, T1 is the minimum execution time with an innite number of processors, P is the number of processors and h is the maxi...
Randomized Receiver Initiated Load Balancing Algorithms for Tree Shaped Computations
 The Computer Journal
"... Many applications in parallel processing have to traverse large, implicitly defined trees with irregular shape. The receiver initiated load balancing algorithm asynchronous random polling has long been known to be very efficient for these problems in practice. Tight bounds for the parallel execut ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Many applications in parallel processing have to traverse large, implicitly defined trees with irregular shape. The receiver initiated load balancing algorithm asynchronous random polling has long been known to be very efficient for these problems in practice. Tight bounds for the parallel execution time in the LogP model are derived based on the parameters of a problem model called tree shaped computations.