Results 1  10
of
45
SpaceEfficient Scheduling of Multithreaded Computations
 SIAM Journal on Computing
, 1993
"... . This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency ..."
Abstract

Cited by 81 (14 self)
 Add to MetaCart
. This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a singleprocessor execution. Utilizing a new graphtheoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on 1 processor. A computation executed on P processors is timeefficient if the time is O(T 1 =P + T1 ), that is, it achieves linear speedup when P = O(T 1 =T1 ), and it is spaceefficient if it uses O(S 1 P ) total space, that is, the space per processor is within a constant factor of that required for a 1processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultan...
The data locality of work stealing
 Theory of Computing Systems
, 2000
"... This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validatio ..."
Abstract

Cited by 65 (12 self)
 Add to MetaCart
This paper studies the data locality of the workstealing scheduling algorithm on hardwarecontrolled sharedmemory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a localityguided workstealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multithreaded computations Gn each member of which requires (n) total instructions (work), for which when using workstealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is (n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nestedparallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(Cd m e PT1), where m is the execution time s of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T1 is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nestedparallel computations using work stealing. For the second part of our results, we present a localityguided work stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of staticpartitioning under traditional work loads but improves the performance up to 50 % over static partitioning under multiprogrammed work loads. Furthermore, the localityguided work stealing improves the performance of workstealing up to 80%. 1
TIGHT ANALYSES OF TWO LOCAL LOAD BALANCING ALGORITHMS
 SIAM J. COMPUT.
, 1999
"... This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We show that within O(∆/α) steps, the algorithm reduces the maximum difference in tokens between any two nodes to at most O((d 2 log n)/α), where ∆ is the global imbalance in tokens (i.e., the maximum difference between the number of tokens at any node initially and the average number of tokens), n is the number of nodes in the network, and α is the edge expansion of the network. The time bound is tight in the sense that for any graph with edge expansion α, and for any value ∆, there exists an initial distribution of tokens with imbalance ∆ for which the time to reduce the imbalance to even ∆/2 is at least Ω(∆/α). The bound on the final imbalance is tight in the sense that there exists a class of networks that can be locally balanced everywhere (i.e., the maximum difference in tokens between any two neighbors is at most 2d), while the global imbalance remains Ω((d 2 log n)/α). Furthermore, we show that upon reaching a state with a global imbalance of O((d 2 log n)/α), the time for this algorithm to locally balance the network can be as large as Ω(n 1/2). We extend our analysis to a variant of this algorithm for dynamic and asynchronous
Parallel Programming using Functional Languages
, 1991
"... I am greatly indebted to Simon Peyton Jones, my supervisor, for his encouragement and technical assistance. His overwhelming enthusiasm was of great support to me. I particularly want to thank Simon and Geoff Burn for commenting on earlier drafts of this thesis. Through his excellent lecturing Cohn ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
I am greatly indebted to Simon Peyton Jones, my supervisor, for his encouragement and technical assistance. His overwhelming enthusiasm was of great support to me. I particularly want to thank Simon and Geoff Burn for commenting on earlier drafts of this thesis. Through his excellent lecturing Cohn Runciman initiated my interest in functional programming. I am grateful to Phil Trinder for his simulator, on which mine is based, and Will Partain for his help with LaTex and graphs. I would like to thank the Science and Engineering Research Council of Great Britain for their financial support. Finally, I would like to thank Michelle, whose culinary skills supported me whilst I was writingup.The Imagination the only nation worth defending a nation without alienation a nation whose flag is invisible and whose borders are forever beyond the horizon a nation whose motto is why have one or the other when you can have one the other and both
Randomized Load Balancing for Treestructured Computation
 In Scalable High Performance Computing Conference
, 1994
"... In this paper, we study the performance of a randomized algorithm for balancing load across a multiprocessor executing a dynamic irregular task tree. Specifically, we show that the time taken to explore a task tree is likely to be within a small constant factor of an inherent lower bound for the tre ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
In this paper, we study the performance of a randomized algorithm for balancing load across a multiprocessor executing a dynamic irregular task tree. Specifically, we show that the time taken to explore a task tree is likely to be within a small constant factor of an inherent lower bound for the tree instance. Our model permits arbitrary task times and overlap between computation and load balance, and thus extends earlier work which assumed fixed cost tasks and used a bulk synchronous style in which the system alternated between distinct computing and load balancing steps. Our analysis is supported by experiments with application codes, demonstrating that the efficiency is high enough to make this method practical. 1 Introduction In this paper we study a popular randomized strategy for load balancing dynamic treestructured task graphs on large scale message passing multiprocessors. First, we show analytically that with high probability, the randomized strategy results in parallel run...
An Atomic Model for MessagePassing
, 1993
"... This paper presents a simple atomic model of messagepassing network systems. Within one synchronous time step each processor can receive one atomic message, perform local computation, and send one message. When several messages are destined to the same processor then one is transmitted and the rest ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
This paper presents a simple atomic model of messagepassing network systems. Within one synchronous time step each processor can receive one atomic message, perform local computation, and send one message. When several messages are destined to the same processor then one is transmitted and the rest are blocked. Blocked messages cannot be retrieved by their sending processors; each processor must wait for its blocked message to clear before sending more messages into the network. Depending on the traffic pattern, messages can remain blocked for arbitrarily long periods. The model is conservative when compared with exisiting messagepassing systems. Nonetheless, we prove linear speedup for backtrack and branchand bound searches using simple randomized algorithms. 1 Introduction Many parallel computers support the messagepassing programming model. Send and receive primitives hide lowlevel architectural details related to the network. Such abstractions are ideal for programming many l...
Automated Parallelization of Discrete Statespace Generation
 Journal of Parallel and Distributed Computing
, 1997
"... We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
We consider the problem of generating a large statespace in a distributed fashion. Unlike previously proposed solutions that partition the set of reachable states according to a hashing function provided by the user, we explore heuristic methods that completely automate the process. The first step is an initial random walk through the state space to initialize a search tree, duplicated in each processor. Then, the reachability graph is built in a distributed way, using the search tree to assign each newly found state to classes assigned to the available processors. Furthermore, we explore two remapping criteria that attempt to balance memory usage or future workload, respectively. We show how the cost of computing the global snapshot required for remapping will scale up for system sizes in the foreseeable future. An extensive set of results is presented to support our conclusions that remapping is extremely beneficial. 1 Introduction Discrete systems are frequently analyzed by genera...
On Runtime Parallel Scheduling for Processor Load Balancing
 IEEE Trans. Parallel and Distributed Systems
, 1997
"... Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compiletime or runtime. It provides highquality load balancing. This paper pre ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compiletime or runtime. It provides highquality load balancing. This paper presents an overview of the parallel scheduling technique. Scheduling algorithms for tree, hypercube, and mesh networks are presented. These algorithms can fully balance the load and maximize locality 1. Introduction Static scheduling balances the workload before runtime and can be applied to problems with a predictable structure, which are called static problems. Dynamic scheduling performs scheduling activities concurrently at runtime, which applies to problems with an unpredictable structure, which are called dynamic problems. Static scheduling utilizes the knowledge of problem characteristics to reach a wellbalanced load [1, 2, 3, 4]. However, it is not able to balance the load for dynami...
BranchandBound and Backtrack Search on MeshConnected Arrays of Processors
 Mathematical Systems Theory
, 1992
"... In this paper we investigate the parallel complexity of backtrack and branchandbound search on the meshconnected array. We present an \Omega\Gamma p dN= p log N) lower bound for the time needed by a randomized algorithm to perform backtrack and branchandbound search of a tree of depth d on ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
In this paper we investigate the parallel complexity of backtrack and branchandbound search on the meshconnected array. We present an \Omega\Gamma p dN= p log N) lower bound for the time needed by a randomized algorithm to perform backtrack and branchandbound search of a tree of depth d on the p N \Theta p N mesh, even when the depth of the tree is known in advance. The lower bound holds also for algorithms that are allowed to move treenodes and create multiple copies of the same treenode. For the upper bounds we give deterministic algorithms that are within a factor of O(log 3 2 N) from our lower bound. Our algorithms do not make any assumption on the shape of the tree to be searched, do not know the depth of the tree in advance and do not move treenodes nor create multiple copies of the same node. The best previously known algorithm for backtrack search on the mesh was randomized and required \Theta(d p N= log N) time. Our algorithm for branchandbound is the fir...
Selection on the BulkSynchronous Parallel Model with Applications to Priority Queues
 IN PROCEEDINGS OF THE 1996 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS
, 1996
"... In this paper we present a new randomized selection algorithm on the BulkSynchronous Parallel (BSP) model of computation, along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our methods improve previous results upon both the c ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
In this paper we present a new randomized selection algorithm on the BulkSynchronous Parallel (BSP) model of computation, along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our methods improve previous results upon both the communication requirements and the amount of parallel slack required to achieve optimal performance. We also establish that optimality to within small multiplicative constant factors can be achieved for a wide range of parallel machines. While these algorithms are fairly simple themselves, descriptions of their performance in terms of the BSP parameters is somewhat involved. The main reward of quantifying these complications is that it allows transportable software to be written for parallel machines that fit the model. We also present experimental results for the selection algorithm that reinforce our claims. 1 Introduction and the BSP Model The main technical contribution of this work is ...