Results 1  10
of
30
TimeOptimal MessageEfficient Work Performance in the Presence of Faults
, 1994
"... Performing work in parallel by a multitude of processes in a distributed environment is currently a fast growing area of computer applications (due to its cost effectiveness). Adaptation of such applications to changes in system's parallelism (i.e., the availability of processes) is essential for im ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
Performing work in parallel by a multitude of processes in a distributed environment is currently a fast growing area of computer applications (due to its cost effectiveness). Adaptation of such applications to changes in system's parallelism (i.e., the availability of processes) is essential for improved performance and reliability. In this work we consider one aspect of coping with dynamic processes failures in such a setting, namely the following scenario formulated by Dwork, Halpern and Waarts [DHW92]: a system of n synchronous processes that communicate only by sending messages to one another. These processes must perform m independent units of work. Processes may fail by crashing and waitfreeness is required, i.e. that whenever at least one process survives, all m units of work will be performed. We consider the notion of fast algorithms in this setting, yet we are not willing to trade improved time for a high cost in communication. Thus, we require message efficiency as well. ...
Resolving Message Complexity of Byzantine Agreement and Beyond
 in Proc. 36th IEEE Symposium on Foundations of Computer Science
, 1995
"... Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: "Crash", "Omission " and "Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission a ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
Byzantine Agreement among processors is a basic primitive in distributed computing. It comes in a number of basic fault models: "Crash", "Omission " and "Malicious" adversarial behaviors. The message complexity of the primitive has been known for the strong failure models of Malicious and Omission adversary since the early 80's, while the question for the more benign Crash failure model has been open. In this paper we show how to solve agreement in the presence of crash failures using O(n) messages which is optimal, thus settling a thirteen year old open problem. Our solution has almost linear time and our new algorithmic techniques have further implications: ffl A family of "early stopping" agreement protocols with improved messagecomplexity. ffl A new solution to "Checkpoint" yielding a substantial improvement of the protocol for distributed work performance under adaptive parallelism in a network of workstations. Columbia University and TelAviv University. galil@cs.columbia.edu...
Stable Leader Election
 In DISC
, 2001
"... We introduce the notion of stable leader election and derive several algorithms for this problem. Roughly speaking, a leader election algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it does not crash and its links have been behaving well, irrespe ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
We introduce the notion of stable leader election and derive several algorithms for this problem. Roughly speaking, a leader election algorithm is stable if it ensures that once a leader is elected, it remains the leader for as long as it does not crash and its links have been behaving well, irrespective of the behavior of other processes and links. In addition to being stable, our leader election algorithms have several desirable properties. In particular, they are all communicationefficient, i.e., they eventually use only n links to carry messages, and they are robust, i.e., they work in systems where only the links to/from some correct process are required to be eventually timely. Moreover, our best leader election algorithm tolerates message losses, and it ensures that a leader is elected in constant time when the system is stable. We conclude the paper by applying the above ideas to derive a robust and efficient algorithm for the eventually perfect failure detector ♦P.
Dynamic Load Balancing with Group Communication
 6TH INTERNATIONAL COLLOQUIUM ON STRUCTURAL INFORMATION AND COMMUNICATION COMPLEXITY
, 1996
"... This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduce ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduces the number of tasks that are performed redundantly because of the reconfigurations. We explore new approaches for load balancing in dynamic networks that canbe employed by applications using a group communication service. The group communication services that we consider include a membership service (establishing new groups to reflect dynamic changes) but does not include maintenance of a primary component. For the nprocessor, ntask load balancing problem defined in this work, the following specific results are obtained.For the case of fully dynamic changes including fragmentation and merges we show that the termination time of any online task assignment algorithm is greater than the termination timeof an offline task assignment algorithm by a factor greater than n/12.We present a load balancing algorithm that guarantees completion of all tasks in all fragments
Reliably executing tasks in the presence of untrusted entities
 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
, 2006
"... In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want t ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want the master to accept only correct values with high probability. Furthermore, we assume that the service provided by the workers is not free; for each task that a worker is assigned, the master is charged with a workunit. Therefore, considering a single task assigned to several workers, our goal is to have the master computer to accept the correct value of the task with high probability, with the smallest possible amount of work (number of workers the master assigns the task). We explore two ways of bounding the number of faulty processors: (a) we consider a fixed bound f < n/2 on the maximum number of workers that may fail, and (b) a probability p < 1/2 of any processor to be faulty (all processors are faulty with probability p, independently of the rest of processors). Our work demonstrates that it is possible to obtain high probability of correct acceptance with low work. In particular, by considering both mechanisms of bounding the number of malicious workers, we first show lower bounds on the minimum amount of (expected) work required, so that any algorithm accepts the correct value with probability of success 1 − ε, where ε ≪ 1 (e.g., 1/n). Then we develop and analyze two algorithms, each using a different decision strategy, and show that both algorithms obtain the same probability of success 1 − ε, and in doing so, they require similar upper bounds on the (expected) work. Furthermore, under certain conditions, these upper bounds are asymptotically optimal with respect to our lower bounds.
Workcompetitive scheduling for cooperative computing with dynamic groups
 SIAM JOURNAL ON COMPUTING
, 2005
"... The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possib ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possibility that groups of processors become disconnected (and, perhaps, reconnected) during the computation. The efficiency of taskperforming algorithms is often assessed in terms of work: the total number of tasks, counting multiplicities, performed by all of the processors during the computation. In general, the scenario where the processors are partitioned into g disconnected components causes any taskperforming algorithm to have work Ω(t · g) even if each group of processors performs no more than the optimal number of Θ(t) tasks. Given that such pessimistic lower bounds apply to any scheduling algorithm, we pursue a competitive analysis. Specifically, this paper studies a simple randomized scheduling algorithm for p asynchronous processors, connected by a dynamically changing communication medium, to complete t known tasks. The performance of this algorithm is compared against that of an omniscient offline algorithm with full knowledge of the future changes in the communication medium. The paper describes a notion of computation width, which associates a natural number with a history of changes in the communication medium, and shows both upper and lower bounds on workcompetitiveness in terms of this quantity. Specifically, it is shown that the simple randomized algorithm obtains the competitive ratio (1 + cw/e), where cw is the computation width and e is the base of the natural logarithm (e =2.7182...); this competitive ratio is then shown to be tight.
Distributed Cooperation during the Absence of Communication
, 2001
"... This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The resul ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The results of this study characterize the ability of the processors to schedule their work so that when some processors establish communication, the wasted (redundant) work these processors have collectively performed prior to that time is controlled. The lower bound for wasted work presented here shows that for any set of schedules there are two processors such that when they complete t1 and t2 tasks respectively the number of redundant tasks is #(t1 t2 /t). For n = t and for schedules longer than # n,thenumberof redundant tasks for two or more processors must be at least 2. The upper bound on pairwise waste for schedules of length # n is shown to be 1. Our e#cient deterministic schedule construction is motivated by design theory. To obtain linear length schedules, a novel deterministic and e#cient construction is given. This construction has the property that pairwise wasted work increases gracefully as processors progress through their schedules. Finally our analysis of a random scheduling solution shows that with high probability pairwise waste is well behaved at all times: specifically, two processors having completed t1 and t2 tasks, respectively, are guaranteed to have no more than t1 t2 /t + # redundant tasks, where #=O(log n + t1 t2 /t # log n).
Cooperative Computing with Fragmentable and Mergeable Groups
 J. Discrete Algorithms
, 2000
"... This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor groups to change over time. GCSs have been recognized as effective building blocks for faulttolerant applications in such settings. Our results explore the efficiency of faulttolerant cooperative computation using GCSs. Prior investigation of this area by Dolev et al. [8] focused on competitive lower bounds, nonredundant task allocation schemes and workefficient algorithms in the presence of fragmentation regroupings. In this work we investigate workefficient and messageefficient algorithms for fragmentation and merge regroupings. We present an algorithm that uses GCSs and implements a coordinatorbased strategy. This algorithm is motivated by the results in [8]. It achieves similar work complexity of O(N f + N) for fragmentations, where f is the number of new groups created by dynamic fragmentations.
Performing Tasks on Synchronous Restartable MessagePassing Processors
 Distributed Computing
, 2000
"... We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We presen ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stopfailures and it does not allow restarts. It has available processor steps (work) complexity S = O((t + p log p= log log p) log f) and message complexity M = O(t + p log p= log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f , it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stopfailures and restarts. This new algorithm is the first solution for the doall problem that efficiently deals with processor restarts. Its available processor steps complexity is S = O((t + p log p + f) minflog p; log fg), and its message complexity is M = O(t+p log p+fp), where f is the total number of failures.
Robust gossiping with an application to consensus
 Journal of Computer and System Sciences
"... We study deterministic gossiping in synchronous systems with dynamic crash failures. Each processor is initialized with an input value called rumor. In the standard gossip problem, the goal of every processor is to learn all the rumors. When processors may crash, then this goal needs to be revised, ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
We study deterministic gossiping in synchronous systems with dynamic crash failures. Each processor is initialized with an input value called rumor. In the standard gossip problem, the goal of every processor is to learn all the rumors. When processors may crash, then this goal needs to be revised, since it is possible, at a point in an execution, that certain rumors are known only to processors that have already crashed. We define gossiping to be completed, for a system with crashes, when every processor knows either the rumor of processor v or that v has already crashed, for any processor v. We design gossiping algorithms that are efficient with respect to both time and communication. Let t < n be the number of failures, where n is the number of processors. If n − t = Ω(n/polylog n), then one of our algorithms completes gossiping in O(log 2 t) time and with O(n polylog n) messages. We develop an algorithm that performs gossiping with O(n 1.77) messages and in O(log 2 n) time, in any execution in which at least one processor remains nonfaulty. We show a tradeoff between time and communication in gossiping algorithms: if the number of messages is at most O(n polylog n), then the time has to be at least Ω ( log n. By way of application, we show that if n − t = Ω(n), then log(n log n)−log t consensus can be solved in O(t) time and with O(n log 2 t) messages.