Results 1  10
of
50
Stable Leader Election
 In DISC
, 2001
"... We introduce the notion of stable leader election and derive several algorithms for this problem. Roughly speaking, a leader election algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it does not crash and its links have been behaving well, irrespe ..."
Abstract

Cited by 46 (4 self)
 Add to MetaCart
(Show Context)
We introduce the notion of stable leader election and derive several algorithms for this problem. Roughly speaking, a leader election algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it does not crash and its links have been behaving well, irrespective of the behavior of other processes and links. In addition to being stable, our leader election algorithms have several desirable properties. In particular, they are all communicationefficient, i.e., they eventually use only n links to carry messages, and they are robust, i.e., they work in systems where only the links to/from some correct process are required to be eventually timely. Moreover, our best leader election algorithm tolerates message losses, and it ensures that a leader is elected in constant time when the system is stable. We conclude the paper by applying the above ideas to derive a robust and efficient algorithm for the eventually perfect failure detector ♦P.
Resolving the message complexity of Byzantine agreement and beyond
 Proceedings of the 3 7"t IEEE Symposium on Foundations of Computer Science (FOCS
, 1995
"... ..."
(Show Context)
Timeoptimal messageefficient work performance in the presence of faults
 In Proceedings of the 13th ACM Symposium on Principles of Distributed Computing (PODC
, 1994
"... ..."
Dynamic Load Balancing with Group Communication
 6TH INTERNATIONAL COLLOQUIUM ON STRUCTURAL INFORMATION AND COMMUNICATION COMPLEXITY
, 1996
"... This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduce ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
(Show Context)
This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduces the number of tasks that are performed redundantly because of the reconfigurations. We explore new approaches for load balancing in dynamic networks that canbe employed by applications using a group communication service. The group communication services that we consider include a membership service (establishing new groups to reflect dynamic changes) but does not include maintenance of a primary component. For the nprocessor, ntask load balancing problem defined in this work, the following specific results are obtained.For the case of fully dynamic changes including fragmentation and merges we show that the termination time of any online task assignment algorithm is greater than the termination timeof an offline task assignment algorithm by a factor greater than n/12.We present a load balancing algorithm that guarantees completion of all tasks in all fragments
Reliably executing tasks in the presence of untrusted entities
 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
, 2006
"... In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want t ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want the master to accept only correct values with high probability. Furthermore, we assume that the service provided by the workers is not free; for each task that a worker is assigned, the master is charged with a workunit. Therefore, considering a single task assigned to several workers, our goal is to have the master computer to accept the correct value of the task with high probability, with the smallest possible amount of work (number of workers the master assigns the task). We explore two ways of bounding the number of faulty processors: (a) we consider a fixed bound f < n/2 on the maximum number of workers that may fail, and (b) a probability p < 1/2 of any processor to be faulty (all processors are faulty with probability p, independently of the rest of processors). Our work demonstrates that it is possible to obtain high probability of correct acceptance with low work. In particular, by considering both mechanisms of bounding the number of malicious workers, we first show lower bounds on the minimum amount of (expected) work required, so that any algorithm accepts the correct value with probability of success 1 − ε, where ε ≪ 1 (e.g., 1/n). Then we develop and analyze two algorithms, each using a different decision strategy, and show that both algorithms obtain the same probability of success 1 − ε, and in doing so, they require similar upper bounds on the (expected) work. Furthermore, under certain conditions, these upper bounds are asymptotically optimal with respect to our lower bounds.
Workcompetitive scheduling for cooperative computing with dynamic groups
 SIAM JOURNAL ON COMPUTING
, 2005
"... The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possib ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possibility that groups of processors become disconnected (and, perhaps, reconnected) during the computation. The efficiency of taskperforming algorithms is often assessed in terms of work: the total number of tasks, counting multiplicities, performed by all of the processors during the computation. In general, the scenario where the processors are partitioned into g disconnected components causes any taskperforming algorithm to have work Ω(t · g) even if each group of processors performs no more than the optimal number of Θ(t) tasks. Given that such pessimistic lower bounds apply to any scheduling algorithm, we pursue a competitive analysis. Specifically, this paper studies a simple randomized scheduling algorithm for p asynchronous processors, connected by a dynamically changing communication medium, to complete t known tasks. The performance of this algorithm is compared against that of an omniscient offline algorithm with full knowledge of the future changes in the communication medium. The paper describes a notion of computation width, which associates a natural number with a history of changes in the communication medium, and shows both upper and lower bounds on workcompetitiveness in terms of this quantity. Specifically, it is shown that the simple randomized algorithm obtains the competitive ratio (1 + cw/e), where cw is the computation width and e is the base of the natural logarithm (e =2.7182...); this competitive ratio is then shown to be tight.
Distributed Cooperation during the Absence of Communication
, 2001
"... This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The resul ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The results of this study characterize the ability of the processors to schedule their work so that when some processors establish communication, the wasted (redundant) work these processors have collectively performed prior to that time is controlled. The lower bound for wasted work presented here shows that for any set of schedules there are two processors such that when they complete t1 and t2 tasks respectively the number of redundant tasks is #(t1 t2 /t). For n = t and for schedules longer than # n,thenumberof redundant tasks for two or more processors must be at least 2. The upper bound on pairwise waste for schedules of length # n is shown to be 1. Our e#cient deterministic schedule construction is motivated by design theory. To obtain linear length schedules, a novel deterministic and e#cient construction is given. This construction has the property that pairwise wasted work increases gracefully as processors progress through their schedules. Finally our analysis of a random scheduling solution shows that with high probability pairwise waste is well behaved at all times: specifically, two processors having completed t1 and t2 tasks, respectively, are guaranteed to have no more than t1 t2 /t + # redundant tasks, where #=O(log n + t1 t2 /t # log n).
Cooperative Computing with Fragmentable and Mergeable Groups
 J. Discrete Algorithms
, 2000
"... This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor groups to change over time. GCSs have been recognized as effective building blocks for faulttolerant applications in such settings. Our results explore the efficiency of faulttolerant cooperative computation using GCSs. Prior investigation of this area by Dolev et al. [8] focused on competitive lower bounds, nonredundant task allocation schemes and workefficient algorithms in the presence of fragmentation regroupings. In this work we investigate workefficient and messageefficient algorithms for fragmentation and merge regroupings. We present an algorithm that uses GCSs and implements a coordinatorbased strategy. This algorithm is motivated by the results in [8]. It achieves similar work complexity of O(N f + N) for fragmentations, where f is the number of new groups created by dynamic fragmentations.
Performing Tasks on Synchronous Restartable MessagePassing Processors
 Distributed Computing
, 2000
"... We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We presen ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stopfailures and it does not allow restarts. It has available processor steps (work) complexity S = O((t + p log p= log log p) log f) and message complexity M = O(t + p log p= log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f , it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stopfailures and restarts. This new algorithm is the first solution for the doall problem that efficiently deals with processor restarts. Its available processor steps complexity is S = O((t + p log p + f) minflog p; log fg), and its message complexity is M = O(t+p log p+fp), where f is the total number of failures.
The Complexity of Synchronous Iterative DoAll with Crashes
, 2001
"... DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solut ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
(Show Context)
DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solutions for DoAll is measured in terms of work complexity where all processing steps taken by the processors are counted. Work is ideally expressed as a function of N , P , and f , the number of processor crashes. However the known lower bounds and the upper bounds for extant algorithms do not adequately show how work depends on f . We present the rst nontrivial lower bounds for DoAll that capture the dependence of work on N , P and f . For the model of computation where processors are able to make perfect loadbalancing decisions locally, we also present matching upper bounds. Thus we give the rst complete analysis of DoAll for this model. We dene the riterative DoAll problem that abstracts the repeated use of DoAll such as found in algorithm simulations. Our fsensitive analysis enables us to derive a tight bound for riterative DoAll work (that is stronger than the rfold work complexity of a single DoAll). Our approach that models perfect loadbalancing allows for the analysis of specic algorithms to be divided into two parts: (i) the analysis of the cost of tolerating failures while performing work, and (ii) the analysis of the cost of implementing loadbalancing. We demonstrate the utility and generality of this approach by improving the analysis of two known ecient algorithms. We give an improved analysis of an ecient messagepassing algorithm (algorithm AN [5]). We also derive a new and complete analysis of the best known DoAll algorithm for...