Results 1  10
of
13
BALANCED ALLOCATIONS: THE HEAVILY LOADED CASE
, 2006
"... We investigate ballsintobins processes allocating m balls into n bins based on the multiplechoice paradigm. In the classical singlechoice variant each ball is placed into a bin selected uniformly at random. In a multiplechoice process each ball can be placed into one out of d ≥ 2 randomly selec ..."
Abstract

Cited by 58 (7 self)
 Add to MetaCart
We investigate ballsintobins processes allocating m balls into n bins based on the multiplechoice paradigm. In the classical singlechoice variant each ball is placed into a bin selected uniformly at random. In a multiplechoice process each ball can be placed into one out of d ≥ 2 randomly selected bins. It is known that in many scenarios having more than one choice for each ball can improve the load balance significantly. Formal analyses of this phenomenon prior to this work considered mostly the lightly loaded case, that is, when m ≈ n. In this paper we present the first tight analysis in the heavily loaded case, that is, when m ≫ n rather than m ≈ n. The best previously known results for the multiplechoice processes in the heavily loaded case were obtained using majorization by the singlechoice process. This yields an upper bound of the maximum load of bins of m/n + O ( √ m ln n/n) with high probability. We show, however, that the multiplechoice processes are fundamentally different from the singlechoice variant in that they have “short memory. ” The great consequence of this property is that the deviation of the multiplechoice processes from the optimal allocation (that is, the allocation in which each bin has either ⌊m/n ⌋ or ⌈m/n ⌉ balls) does not increase with the number of balls as in the case of the singlechoice process. In particular, we investigate the allocation obtained by two different multiplechoice allocation schemes,
Network Processor Load Balancing for HighSpeed Links
, 2002
"... While transmission rates already achieve speeds beyond 40 Gb/s, today's network processors are only slowly approaching 10 Gb/s. In this paper we present a loadbalancing scheme that enables system designers to bridge the performance gap using multiple slower NPs in parallel to serve highspeed links ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
While transmission rates already achieve speeds beyond 40 Gb/s, today's network processors are only slowly approaching 10 Gb/s. In this paper we present a loadbalancing scheme that enables system designers to bridge the performance gap using multiple slower NPs in parallel to serve highspeed links. The proposed scheme works in a flowpreserving manner to ensure insequence packet delivery as well as local validity of connection state information, while avoiding interprocessor communication. The effectiveness of the algorithms is evaluated by simulation with extrapolated workloads, and the impact of specific parameters on system performance is the subject of a factorrelevance analysis.
The natural workstealing algorithm is stable
 In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS
, 2001
"... In this paper we analyse a very simple dynamic workstealing algorithm. In the workgeneration model, there are n (work) generators. A generatorallocation function is simply a function from the n generators to the n processors. We consider a fixed, but arbitrary, distribution D over generatoralloca ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
In this paper we analyse a very simple dynamic workstealing algorithm. In the workgeneration model, there are n (work) generators. A generatorallocation function is simply a function from the n generators to the n processors. We consider a fixed, but arbitrary, distribution D over generatorallocation functions. During each timestep of our process, a generatorallocation function h is chosen from D, and the generators are allocated to the processors according to h. Each generator may then generate a unittime task which it inserts into the queue of its host processor. It generates such a task independently with probability λ. After the new tasks are generated, each processor removes one task from its queue and services it. For many choices of D, the workgeneration model allows the load to become arbitrarily imbalanced, even when λ < 1. For example, D could be the point distribution containing a single function h which allocates all of the generators to just one processor. For this choice of D, the chosen processor receives around λn units of work at each step and services one. The natural workstealing algorithm that we analyse is widely used in practical applications and works as follows. During each time step, each empty
Scalable Work Stealing ∗
"... Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on largescale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challe ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on largescale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on largescale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producerconsumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
Load Balancing in Arbitrary Network Topologies with Stochastic Adversarial Input
 SIAM Journal on Computing
, 2005
"... We study the longterm (steady state) performance of a simple, randomized, local load balancing technique under a broad range of input conditions. We assume a system of n processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversa ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
We study the longterm (steady state) performance of a simple, randomized, local load balancing technique under a broad range of input conditions. We assume a system of n processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversary. The adversary knows the current and past load distribution in the network and can use this information to place the new tasks in the processors. A node can execute one job per step, and can also participate in one load balancing operation in which it can move tasks to a direct neighbor in the network. In the protocol we analyze here, a node equalizes its load with a random neighbor in the graph.
Enhancing the effective utilisation of Grid clusters by exploiting OnLine Performability Analysis
"... In Grid applications the heterogeneity and potential failures of the computing infrastructure poses significant challenges to efficient scheduling. Performance models have been shown to be useful in providing predictions on which schedules can be based [1, 2] and most such techniques can also take a ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
In Grid applications the heterogeneity and potential failures of the computing infrastructure poses significant challenges to efficient scheduling. Performance models have been shown to be useful in providing predictions on which schedules can be based [1, 2] and most such techniques can also take account of failures and degraded service. However, when several alternative schedules are to be compared it is vital that the analysis of the models does not become so costly as to outweigh the potential gain of choosing the best schedule. Moreover, it is vital that the modelling approach can scale to match the size and complexity of realistic applications. In this
Stability and Efficiency of a Random Local Load Balancing Protocol
 In Proceedings FOCS
, 2003
"... We study the long term (steady state) performance of a simple, randomized, local load balancing technique. We assume a system of n processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversary. The adversary knows the current and ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We study the long term (steady state) performance of a simple, randomized, local load balancing technique. We assume a system of n processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversary. The adversary knows the current and past load distribution in the network and can use this information to place the new tasks in the processors. The adversary can put a number of new jobs in each processor, in each step, as long as the (expected) total number of new jobs arriving at a given step is bounded by #n. A node can execute one job per step, and also participate in one load balancing operation in which it can move tasks to a direct neighbor in the network. In the protocol we analyze here, a node equalizes its load with a random neighbor in the graph.
Dynamic Load Balancing Issues In The Earth Runtime System
, 1999
"... Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loopbased algorithms but apply also to irregular parallelism. EARTH  Efficient Architecture ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loopbased algorithms but apply also to irregular parallelism. EARTH  Efficient Architecture for Running THreads, is a multithreaded model supporting finegrain, nonpreemptive threads. This model is supported by a Cbased runtime system which provides the multithreaded environment for the execution of concurrent programs. This thesis describes the design and implementation of a set of dynamic load balancing algorithms, and an indepth study of their behavior with divideandconquer, regular, and irregular classes of applications. The results described in this thesis are based on EARTHSP2, an implementation of the EARTH program execution model on the IBM SP2, a distributed memory multiprocessor system. The main results of this study are as follows: ffl A randomizing load balance...
Asynchronous Random Polling Dynamic Load Balancing
 In Proceedings of ISAAC’99
, 1999
"... Many applications in parallel processing have to traverse large, implicitly defined trees with irregular shape. The receiver initiated load balancing algorithm random polling has long been known to be very efficient for these problems in practice. For any ffl ? 0, we prove that its parallel executio ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Many applications in parallel processing have to traverse large, implicitly defined trees with irregular shape. The receiver initiated load balancing algorithm random polling has long been known to be very efficient for these problems in practice. For any ffl ? 0, we prove that its parallel execution time is at most (1 + ffl)T seq =P + O(Tatomic + h( ffl +Trout +T split )) with high probability, where Trout , T split and Tatomic bound the time for sending a message, splitting a subproblem and finishing a small unsplittable subproblem respectively. The maximum splitting depth h is related to the depth of the computation tree. Previous work did not prove efficiency close to one and used less accurate models. In particular, our machine model allows asynchronous communication with nonconstant message delays and does not assume that communication takes place in rounds. This model is compatible with the LogP model.
Scheduling parallel programs by work stealing with private deques
, 2013
"... Work stealing has proven to be an effective method for scheduling finegrained parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, assigned to each processor. Each processor operates on its deque locally exc ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Work stealing has proven to be an effective method for scheduling finegrained parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weakmemory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divideandconquer, nested data parallel paradigm. For these reasons, there has been a lot recent interest in implementations of work stealing with nonconcurrent deques, where deques