Results 1  10
of
56
Improving the reliability of Internet paths with onehop source routing
 In OSDI
, 2004
"... Recent work has focused on increasing availability in the face of Internet path failures. To date, proposed solutions have relied on complex routing and pathmonitoring schemes, trading scalability for availability among a relatively small set of hosts. This paper proposes a simple, scalable approach ..."
Abstract

Cited by 136 (8 self)
 Add to MetaCart
Recent work has focused on increasing availability in the face of Internet path failures. To date, proposed solutions have relied on complex routing and pathmonitoring schemes, trading scalability for availability among a relatively small set of hosts. This paper proposes a simple, scalable approach to recover from Internet path failures. Our contributions are threefold. First, we conduct a broad measurement study of Internet path failures on a collection of 3,153 Internet destinations consisting of popular Web servers, broadband hosts, and randomly selected nodes. We monitored these destinations from 67 PlanetLab vantage points over a period of seven days, and found availabilities ranging from 99.6 % for servers to 94.4 % for broadband hosts. When failures do occur, many appear too close to the destination (e.g., lasthop and endhost failures) to be mitigated through alternative routing techniques of any kind. Second, we show that for the failures that can be addressed through routing, a simple, scalable technique, called onehop source routing, can achieve close to the maximum benefit available with very low overhead. When a path failure occurs, our scheme attempts to recover from it by routing indirectly through a small set of randomly chosen intermediaries. Third, we implemented and deployed a prototype onehop source routing infrastructure on PlanetLab. Over a three day period, we repeatedly fetched documents from 982 popular Internet Web servers and used onehop source routing to attempt to route around the failures we observed. Our results show that our prototype successfully recovered from 56 % of network failures. However, we also found a large number of server failures that cannot be addressed through alternative routing. Our research demonstrates that onehop source routing is easy to implement, adds negligible overhead, and achieves close to the maximum benefit available to indirect routing schemes, without the need for path monitoring, history, or apriori knowledge of any kind. 1
How Useful Is Old Information
 IEEE Transactions on Parallel and Distributed Systems
, 2000
"... AbstractÐWe consider the problem of load balancing in dynamic distributed systems in cases where new incoming tasks can make use of old information. For example, consider a multiprocessor system where incoming tasks with exponentially distributed service requirements arrive as a Poisson process, the ..."
Abstract

Cited by 82 (10 self)
 Add to MetaCart
AbstractÐWe consider the problem of load balancing in dynamic distributed systems in cases where new incoming tasks can make use of old information. For example, consider a multiprocessor system where incoming tasks with exponentially distributed service requirements arrive as a Poisson process, the tasks must choose a processor for service, and a task knows when making this choice the processor queue lengths from T seconds ago. What is a good strategy for choosing a processor in order for tasks to minimize their expected time in the system? Such models can also be used to describe settings where there is a transfer delay between the time a task enters a system and the time it reaches a processor for service. Our models are based on considering the behavior of limiting systems where the number of processors goes to infinity. The limiting systems can be shown to accurately describe the behavior of sufficiently large systems and simulations demonstrate that they are reasonably accurate even for systems with a small number of processors. Our studies of specific models demonstrate the importance of using randomness to break symmetry in these systems and yield important rules of thumb for system design. The most significant result is that only small amounts of queue length information can be extremely useful in these settings; for example, having incoming tasks choose the least loaded of two randomly chosen processors is extremely effective over a large range of possible system parameters. In contrast, using global information can actually degrade performance unless used carefully; for example, unlike most settings where the load information is current, having tasks go to the apparently least loaded server can significantly hurt performance. Index TermsÐLoad balancing, stale information, old information, queuing theory, large deviations. æ 1
The natural workstealing algorithm is stable
 In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS
, 2001
"... In this paper we analyse a very simple dynamic workstealing algorithm. In the workgeneration model, there are n (work) generators. A generatorallocation function is simply a function from the n generators to the n processors. We consider a fixed, but arbitrary, distribution D over generatoralloca ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
In this paper we analyse a very simple dynamic workstealing algorithm. In the workgeneration model, there are n (work) generators. A generatorallocation function is simply a function from the n generators to the n processors. We consider a fixed, but arbitrary, distribution D over generatorallocation functions. During each timestep of our process, a generatorallocation function h is chosen from D, and the generators are allocated to the processors according to h. Each generator may then generate a unittime task which it inserts into the queue of its host processor. It generates such a task independently with probability λ. After the new tasks are generated, each processor removes one task from its queue and services it. For many choices of D, the workgeneration model allows the load to become arbitrarily imbalanced, even when λ < 1. For example, D could be the point distribution containing a single function h which allocates all of the generators to just one processor. For this choice of D, the chosen processor receives around λn units of work at each step and services one. The natural workstealing algorithm that we analyse is widely used in practical applications and works as follows. During each time step, each empty
Cluster Load Balancing for Finegrain Network Services
 IN PROC. OF INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM, FORT LAUDERDALE, FL
, 2002
"... This paper studies cluster load balancing policies and system support for finegrain network services. Load balancing on a cluster of machines has been studied extensively in the literature, mainly focusing on coarsegrain distributed computation. Finegrain services introduce additional challenges ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
This paper studies cluster load balancing policies and system support for finegrain network services. Load balancing on a cluster of machines has been studied extensively in the literature, mainly focusing on coarsegrain distributed computation. Finegrain services introduce additional challenges because system states fluctuate rapidly for those services and system performance is highly sensitive to various overhead. The main contribution of our work is to identify effective load balancing schemes for finegrain services through simulations and empirical evaluations on synthetic workload and real traces. Another contribution is the design and implementation of a load balancing system in a Linux cluster that strikes a balance between acquiring enough load information and minimizing system overhead. Our study concludes that: 1) Random polling based loadbalancing policies are wellsuited for finegrain network services; 2) A small poll size provides sufficient information for load balancing, while an excessively large poll size may in fact degrade the performance due to polling overhead; 3) Discarding slowresponding polls can further improve system performance.
Analyses of Load Stealing Models Based on Differential Equations
 In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1998
"... In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to m ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to model a large variety of systems and to provide accurate numerical approximations of system behavior even when the number of processors is relatively small. We show how this approach can yield significant intuition about the behavior of work stealing algorithms in realistic settings.
Stability of load balancing algorithms in dynamic adversarial systems
 In Proc. of the 34th ACM Symp. on Theory of Computing (STOC
, 2002
"... Abstract. In the dynamic load balancing problem, we seek to keep the job load roughly evenly distributed among the processors of a given network. The arrival and departure of jobs is modeled by an adversary restricted in its power. Muthukrishnan and Rajaraman (1998) gave a clean characterization of ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Abstract. In the dynamic load balancing problem, we seek to keep the job load roughly evenly distributed among the processors of a given network. The arrival and departure of jobs is modeled by an adversary restricted in its power. Muthukrishnan and Rajaraman (1998) gave a clean characterization of a restriction on the adversary that can be considered the natural analogue of a cut condition. They proved that a simple local balancing algorithm proposed by Aiello et. al. (1993) is stable against such an adversary if the insertion rate is restricted to a (1 − ε) fraction of the cut size. They left as an open question whether the algorithm is stable at rate 1. In this paper, we resolve this question positively, by proving stability of the local algorithm at rate 1. Our proof techniques are very different from the ones used by Muthukrishnan and Rajaraman, and yield a simpler proof and tighter bounds on the difference in loads. In addition, we introduce a multicommodity version of this load balancing model, and show how to extend the result to the case of balancing two different kinds of loads at once (obtaining as a corollary a new proof of the 2commodity MaxFlow MinCut Theorem). We also show how to apply the proof techniques to the problem of routing packets in adversarial systems. Awerbuch et. al. (2001) showed that the same load balancing algorithm is stable against an adversary inserting
On Balls and Bins with Deletions
 In Proc. of the RANDOM'98
, 1998
"... Microsystems. The views and conclusions contained here are those of the authors and should not be interpreted as necessarily representing the official policies or ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
Microsystems. The views and conclusions contained here are those of the authors and should not be interpreted as necessarily representing the official policies or
Allocating Weighted Jobs in Parallel
, 1997
"... It is well known that after placing m n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains \Theta(log n= log log n+ m n ) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O \Gamma m n \Delta can be obtained for all m n ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
It is well known that after placing m n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains \Theta(log n= log log n+ m n ) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O \Gamma m n \Delta can be obtained for all m n if a ball is allocated in one (suitably chosen) of two (i.u.r.) bins. Stemann ([Ste96]) shows that r communication rounds suffice to guarantee a maximum load of maxf r p log n; O \Gamma m n \Delta g, with high probability. Adler et al. have shown in [ACMR95] that Stemanns protocol is optimal for constant r. In this paper we extend the above results in two directions: We generalize the lower bound to arbitrary r log log n. This implies that the result of Stemanns protocol is optimal for all r. Our main result is a generalization of Stemanns upper bound to weighted jobs: Let W A (W M ) denote the average (maximum) weight of the balls. Further let \Delta = W A =W M . Note that...
Recovery time of dynamic allocation processes
 IN PROCEEDINGS OF THE 10TH ANNUAL ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES, PUERTO VALLARTA, MEXICO, 28 JUNE–2
, 1998
"... Many distributed protocols arising in applications in online load balancing and dynamic resource allocation can be modeled by dynamic allocation processes related to the “balls into bin” problems. Traditionally the main focus of the research on dynamic allocation processes is on verifying whether a ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Many distributed protocols arising in applications in online load balancing and dynamic resource allocation can be modeled by dynamic allocation processes related to the “balls into bin” problems. Traditionally the main focus of the research on dynamic allocation processes is on verifying whether a given process is stable, and if so, on analyzing its behavior in the limit (i.e., after sufficiently many steps). Once we know that the process is stable and we know its behavior in the limit, it is natural to analyze its recovery time, which is the time needed by the process to recover from any arbitrarily bad situation and to arrive very closely to a stable (i.e., a typical) state. This investigation is important to provide assurance that even if at some stage the process has reached a highly undesirable state, we can predict with high confidence its behavior after the estimated recovery time. In this paper we present a genera / framework to study the recovery time of discretetime dynamic allocation processes. We model allocation processes by suitably chosen ergodic Markov chains. For a given Markov chain we apply path coupling arguments to bound its convergence rates to the stationary distribution, which directly yields the estimation of the recovery time of the corresponding allocation process. Our coupling approach provides in a relatively simple way an accurate prediction of the recovery time. In particular, we show that our method can be applied to significantly improve estimations of the recovery time for various allocation processes related to allocations of balls into bins, and for the edge orientation problem studied before by Ajtai et al.
Analyzing an Infinite Parallel Job Allocation Process
"... In recent years the task of allocating jobs to servers has been studied with the "balls and bins" abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a very small amount of choice in choosing its destination serve ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
In recent years the task of allocating jobs to servers has been studied with the "balls and bins" abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a very small amount of choice in choosing its destination server (bin). The scenarios considered can be divided into two categories: sequential, where each job can be placed at a server before the next job arrives, and parallel, where the jobs arrive in large batches that must be dealt with simultaneously. Another, orthogonal, classification of load balancing scenarios is into fixed time and infinite. Fixed time processes are only analyzed for an interval of time that is known in advance, and for all such results thus far either the number of rounds or the total expected number of arrivals at each server is a constant. In the infinite case, there is an arrival process and a deletion process that are both defined over an infinite time line. In this pape...