Results 1  10
of
14
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 113 (17 self)
 Add to MetaCart
(Show Context)
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
SpaceEfficient Scheduling of Parallelism with Synchronization Variables
"... Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchr ..."
Abstract

Cited by 34 (11 self)
 Add to MetaCart
(Show Context)
Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchronize with ancestor or sibling threads. Such computations do not include languages with futures or userspecified synchronization constraints. Here we extend the results to languages with synchronization variables. Such languages include languages with futures, such as Multilisp and Cool, as well as other languages such asid. The main result is an online scheduling algorithm which, given a computation with w work (total operations), synchronizations, d depth (critical path) and s1 sequential space, will run in O(w=p + log(pd)=p + d log(pd)) time and s1 + O(pd log(pd)) space, on a pprocessor crcw pram with a fetchandadd primitive. This includes all time and space costs for both the computation and the scheduler. The scheduler is nonpreemptive in the sense that it will only move a thread if the thread suspends on a synchronization, forks a new thread, or exceeds a threshold when allocating space. For the special case where the computation is a planar graph with lefttoright synchronization edges, the scheduling algorithm can be implemented in O(w=p+d log p) time and s1 + O(pd log p) space. These are the first nontrivial space bounds described for such languages.
A Provably TimeEfficient Parallel Implementation of Full Speculation
 In Proceedings of the 23rd ACM Symposium on Principles of Programming Languages
, 1996
"... Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (callbyspeculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Beyond Nested Parallelism: Tight Bounds on WorkStealing Overheads for Parallel Futures
"... Work stealing is a popular method of scheduling finegrained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nestedparallel (or fully strict) computations. We extend this prior work by consi ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Work stealing is a popular method of scheduling finegrained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nestedparallel (or fully strict) computations. We extend this prior work by considering a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of workstealing schedulers is often quantified in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify workstealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on workstealing overheads—scheduler time and cache misses—as a function of the number of deviations. Deviations can occur, for example, when work is stolen or when a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations is O(P d+td) in a Pprocessor execution of a computation with span d and t touches of futures. Moreover, this bound is existentially tight for any workstealing scheduler that is parsimonious (those where processors steal only when their queues are empty); this class includes all prior workstealing schedulers. We also present empirical measurements of the number of deviations incurred by a classic application of futures, Halstead’s quicksort, using our parallel implementation of ML. Finally, we identify a family of applications that use futures and, in contrast to quicksort, incur significantly smaller overheads.
Scheduling Deterministic Parallel Programs
, 2009
"... are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
are those of the author and should not be interpreted as representing the official policies, either expressed or implied, Deterministic parallel programs yield the same results regardless of how parallel tasks are interleaved or assigned to processors. This drastically simplifies reasoning about the correctness of these programs. However, the performance of parallel programs still depends upon this assignment of tasks, as determined by a part of the language implementation called the scheduling policy. In this thesis, I define a novel cost semantics for a parallel language that enables programmers to reason formally about different scheduling policies. This cost semantics forms a basis for a suite of prototype profiling tools. These tools allow programmers to simulate and visualize program execution under different scheduling policies and understand how the choice of policy affects application memory use. My cost semantics also provides a specification for implementations of the language. As an example of such an implementation, I have extended MLton, a compiler
MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores
"... In the past decade the pervasive ubiquitous of multicore processors has stimulated a constantly increasing effort in the development of concurrency platforms (such as CilkPlus, OpenMP, Intel TBB) targeting those architectures. While those programming languages are all based on the forkjoin parallel ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
In the past decade the pervasive ubiquitous of multicore processors has stimulated a constantly increasing effort in the development of concurrency platforms (such as CilkPlus, OpenMP, Intel TBB) targeting those architectures. While those programming languages are all based on the forkjoin parallelism model, they largely differ on their way of expressing parallel algorithms
MetaFork: A Framework for Concurrency Platforms Targeting Multicores
"... Abstract. We present MetaFork, a metalanguage for multithreaded algorithms based on the forkjoin concurrency model and targeting multicore architectures. MetaFork is implemented as a sourcetosource compilation framework allowing automatic translation of programs from one concurrency platform to ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We present MetaFork, a metalanguage for multithreaded algorithms based on the forkjoin concurrency model and targeting multicore architectures. MetaFork is implemented as a sourcetosource compilation framework allowing automatic translation of programs from one concurrency platform to another. The current version of this framework supports CilkPlus and OpenMP. We evaluate the benefits of the MetaFork framework through a series of experiments, such as narrowing performance bottlenecks in multithreaded programs. Our experiments show also that, if a native program, written either in CilkPlus or OpenMP, has little parallelism overhead, then the same property holds for its OpenMP or CilkPlus counterpart translated by MetaFork. 1
Race Detection in Two Dimensions
"... Dynamic data race detection is a program analysis technique for detecting errors provoked by undesired interleavings of concurrent threads. A primary challenge when designing efficient race detection algorithms is to achieve manageable space requirements. State of the art algorithms for unstructured ..."
Abstract
 Add to MetaCart
(Show Context)
Dynamic data race detection is a program analysis technique for detecting errors provoked by undesired interleavings of concurrent threads. A primary challenge when designing efficient race detection algorithms is to achieve manageable space requirements. State of the art algorithms for unstructured parallelism require Θ(n) space per monitored memory location, where n is the total number of tasks. This is a serious drawback when analyzing programs with many tasks. In contrast, algorithms for programs with a seriesparallel (SP) structure require only Θ(1) space. Unfortunately, it is currently poorly understood if there are classes of parallelism beyond SP that can also benefit from and be analyzed with Θ(1) space complexity. In the present work, we show that structures richer than SP graphs, namely that of twodimensional (2D) lattices, can be analyzed in Θ(1) space: a) we extend Tarjan’s algorithm for finding lowest common ancestors to handle 2D lattices; b) from that extension we derive a serial algorithm for race detection that can analyze arbitrary task graphs having a 2D lattice structure; c) we present a restriction to forkjoin that admits precisely the 2D lattices as task graphs (e.g., it can express pipeline parallelism). Our work generalizes prior work on race detection, and aims to provide a deeper understanding of the interplay between structured parallelism and program analysis efficiency. 1.
unknown title
"... Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental va ..."
Abstract
 Add to MetaCart
Abstract This paper studies the data locality of the workstealing schedulingalgorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache missesusing work stealing, and introduce a localityguided workstealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires \Theta (n)total operations (work), for which when using workstealing the total number of cache misses on one processor is constant, while evenon two processors the total number of cache misses is \Omega (n). Fornestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd ms e P T1), where m isthe execution time of an instruction incurring a cache miss, s is thesteal time, C is the size of cache, and T1 is the number of nodeson the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computationsusing work stealing. For the second part of our results, we present a localityguidedwork stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity fora processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammedwork loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%.
Multiscale Scheduling: Integrating Competitive and Cooperative Scheduling in Theory and in Practice Overview
"... A chief characteristic of nextgeneration computing systems is the prevalence of parallelism at multiple levels of granularity. From the instruction level to the chip level to server level to the grid level, parallelism is the dominant method of improving performance relative to cost. While the char ..."
Abstract
 Add to MetaCart
(Show Context)
A chief characteristic of nextgeneration computing systems is the prevalence of parallelism at multiple levels of granularity. From the instruction level to the chip level to server level to the grid level, parallelism is the dominant method of improving performance relative to cost. While the characteristics of the fabric such as the granularity or the interconnect differ at each level, the common theme is parallel computing. Building applications that take full advantage of parallelism remains a significant challenge, even when exclusive access to the computing fabric is assumed. But a chief characteristic of nextgeneration computing is simultaneous access to shared computing resources by millions of users. For example, an Internet search engine must serve multiple simultaneous queries, each of which is decomposed into multiple parallel tasks whose results are combined to produce a response to a query. The need for parallelism at the level of the individual query arises not only to improve response time, but also to permit decomposition of the search space into fragments that can be managed by individual processors. With the emergence of increasingly sophisticated Web services, one can readily envision greater demand for multiscale parallelism. For example, electronic commerce applications may service many purchase transactions simultaneously, each of which can be broken up into subtransactions that can be executed