Results 1  10
of
37
Performing work efficiently in the presence of faults
 in the Proceedings of the 11 th ACM Symposium on Principles of Distributed Computing (PODC
, 1998
"... Abstract. We consider a system of t synchronous processes that communicate only by sending messages to one another, and that together must perform n independent units of work. Processes may fail by crashing; we want to guarantee that in every execution of the protocol in which at least one process s ..."
Abstract

Cited by 46 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We consider a system of t synchronous processes that communicate only by sending messages to one another, and that together must perform n independent units of work. Processes may fail by crashing; we want to guarantee that in every execution of the protocol in which at least one process survives, all n units of work will be performed. We consider three parameters: the number of messages sent, the total number of units of work performed (including multiplicities), and time. We present three protocols for solving the problem. All three are workoptimal, doing O(n+t) work. The first has moderate costs in the remaining two parameters, sending O(t √ t) messages, and taking O(n + t) time. This protocol can be easily modified to run in any completely asynchronous system equipped with a failure detection mechanism. The second sends only O(tlog t) messages, but its running time is large (O(t 2 (n+t)2 n+t)). The third is essentially timeoptimal in the (usual) case in which there are no failures, and its time complexity degrades gracefully as the number of failures increases.
Dynamic Load Balancing with Group Communication
 6TH INTERNATIONAL COLLOQUIUM ON STRUCTURAL INFORMATION AND COMMUNICATION COMPLEXITY
, 1996
"... This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduce ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
(Show Context)
This work considers the problem of efficiently performing a set of tasks using a network ofprocessors in the setting where the network is subject to dynamic reconfigurations, including partitions and merges. A key challenge for this setting is the implementation of dynamic loadbalancing that reduces the number of tasks that are performed redundantly because of the reconfigurations. We explore new approaches for load balancing in dynamic networks that canbe employed by applications using a group communication service. The group communication services that we consider include a membership service (establishing new groups to reflect dynamic changes) but does not include maintenance of a primary component. For the nprocessor, ntask load balancing problem defined in this work, the following specific results are obtained.For the case of fully dynamic changes including fragmentation and merges we show that the termination time of any online task assignment algorithm is greater than the termination timeof an offline task assignment algorithm by a factor greater than n/12.We present a load balancing algorithm that guarantees completion of all tasks in all fragments
Performing tasks on restartable messagepassing processors
 in Proc. of the 11th Intl Workshop on Distr. Alg. (WDAG’97
, 1997
"... Abstract. This work presents new algorithms for the "DoAll " problem that consists of performing t tasks reliably in a messagepassing synchronous system of p faultprone processors. The algorithms are based on an aggressive coordination paradigm in which multiple coordinators may be acti ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
(Show Context)
Abstract. This work presents new algorithms for the "DoAll " problem that consists of performing t tasks reliably in a messagepassing synchronous system of p faultprone processors. The algorithms are based on an aggressive coordination paradigm in which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stopfailures and it does not allow restarts. It has the available processor steps complexity S = O((t + plogp/loglogp), log f) and the message complexity M = O(t + plogp/loglogp + f • p). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for large f, it has better S complexity. This algorithm is used as the basis for another algorithm which tolerates any pattern of stopfailures and restarts. This new algorithm is the first solution for the DoAll problem that efficiently deals with processor restarts. Its available processor steps complexity is S = O((t + p log p + f). rain{log p, log f}), and its message complexity is M = O(t +p. logp + f.p), where f is the number of failures. 1
Performing Tasks on Synchronous Restartable MessagePassing Processors
 Distributed Computing
, 2000
"... We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We presen ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
We consider the problem of performing t tasks in a distributed system of p faultprone processors. This problem, called doall herein, was introduced by Dwork, Halpern and Waarts. Our work deals with a synchronous messagepassing distributed system with processor stopfailures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stopfailures and it does not allow restarts. It has available processor steps (work) complexity S = O((t + p log p= log log p) log f) and message complexity M = O(t + p log p= log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f , it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stopfailures and restarts. This new algorithm is the first solution for the doall problem that efficiently deals with processor restarts. Its available processor steps complexity is S = O((t + p log p + f) minflog p; log fg), and its message complexity is M = O(t+p log p+fp), where f is the total number of failures.
Cooperative Computing with Fragmentable and Mergeable Groups
 J. Discrete Algorithms
, 2000
"... This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor groups to change over time. GCSs have been recognized as effective building blocks for faulttolerant applications in such settings. Our results explore the efficiency of faulttolerant cooperative computation using GCSs. Prior investigation of this area by Dolev et al. [8] focused on competitive lower bounds, nonredundant task allocation schemes and workefficient algorithms in the presence of fragmentation regroupings. In this work we investigate workefficient and messageefficient algorithms for fragmentation and merge regroupings. We present an algorithm that uses GCSs and implements a coordinatorbased strategy. This algorithm is motivated by the results in [8]. It achieves similar work complexity of O(N f + N) for fragmentations, where f is the number of new groups created by dynamic fragmentations.
The Complexity of Synchronous Iterative DoAll with Crashes
, 2001
"... DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solut ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solutions for DoAll is measured in terms of work complexity where all processing steps taken by the processors are counted. Work is ideally expressed as a function of N , P , and f , the number of processor crashes. However the known lower bounds and the upper bounds for extant algorithms do not adequately show how work depends on f . We present the rst nontrivial lower bounds for DoAll that capture the dependence of work on N , P and f . For the model of computation where processors are able to make perfect loadbalancing decisions locally, we also present matching upper bounds. Thus we give the rst complete analysis of DoAll for this model. We dene the riterative DoAll problem that abstracts the repeated use of DoAll such as found in algorithm simulations. Our fsensitive analysis enables us to derive a tight bound for riterative DoAll work (that is stronger than the rfold work complexity of a single DoAll). Our approach that models perfect loadbalancing allows for the analysis of specic algorithms to be divided into two parts: (i) the analysis of the cost of tolerating failures while performing work, and (ii) the analysis of the cost of implementing loadbalancing. We demonstrate the utility and generality of this approach by improving the analysis of two known ecient algorithms. We give an improved analysis of an ecient messagepassing algorithm (algorithm AN [5]). We also derive a new and complete analysis of the best known DoAll algorithm for...
Robust gossiping with an application to consensus
 Journal of Computer and System Sciences
"... We study deterministic gossiping in synchronous systems with dynamic crash failures. Each processor is initialized with an input value called rumor. In the standard gossip problem, the goal of every processor is to learn all the rumors. When processors may crash, then this goal needs to be revised, ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
We study deterministic gossiping in synchronous systems with dynamic crash failures. Each processor is initialized with an input value called rumor. In the standard gossip problem, the goal of every processor is to learn all the rumors. When processors may crash, then this goal needs to be revised, since it is possible, at a point in an execution, that certain rumors are known only to processors that have already crashed. We define gossiping to be completed, for a system with crashes, when every processor knows either the rumor of processor v or that v has already crashed, for any processor v. We design gossiping algorithms that are efficient with respect to both time and communication. Let t < n be the number of failures, where n is the number of processors. If n − t = Ω(n/polylog n), then one of our algorithms completes gossiping in O(log 2 t) time and with O(n polylog n) messages. We develop an algorithm that performs gossiping with O(n 1.77) messages and in O(log 2 n) time, in any execution in which at least one processor remains nonfaulty. We show a tradeoff between time and communication in gossiping algorithms: if the number of messages is at most O(n polylog n), then the time has to be at least Ω ( log n. By way of application, we show that if n − t = Ω(n), then log(n log n)−log t consensus can be solved in O(t) time and with O(n log 2 t) messages.
Optimal Scheduling for Disconnected Cooperation
, 2001
"... We consider a distributed environment consisting of n processors that need to perform t tasks. We assume that communication is initially unavailable and that processors begin work in isolation. At some unknown point of time an unknown collection of processors may establish communication. Before proc ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
We consider a distributed environment consisting of n processors that need to perform t tasks. We assume that communication is initially unavailable and that processors begin work in isolation. At some unknown point of time an unknown collection of processors may establish communication. Before processors begin communication they execute tasks in the order given by their schedules. Our goal is to schedule work of isolated processors so that when communication is established for the rst time, the number of redundantly executed tasks is controlled. We quantify worst case redundancy as a function of processor advancements through their schedules. In this work we rene and simplify an extant deterministic construction for schedules with n t, and we develop a new analysis of its waste. The new analysis shows that for any pair of schedules, the number of redundant tasks can be controlled for the entire range of t tasks. Our new result is asymptotically optimal: the tails of these schedules are within a 1 +O(n 1 4 ) factor of the lower bound. We also present two new deterministic constructions one for t n, and the other for t n 3=2 , which substantially improve pairwise waste for all prexes of length t= p n, and oer near optimal waste for the tails of the schedules. Finally, we present bounds for waste of any collection of k 2 processors for both deterministic and randomized constructions. 1
The DoAll Problem with Byzantine Processor Failures
 Theoretical Computer Science
, 2003
"... DoAll is the abstract problem of using n processors to cooperatively perform m independent tasks in the presence of failures. This problem can be used as the cornerstone in identifying aspects of the tradeoff between efficiency and faulttolerance in cooperative computing and in developing efficie ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
DoAll is the abstract problem of using n processors to cooperatively perform m independent tasks in the presence of failures. This problem can be used as the cornerstone in identifying aspects of the tradeoff between efficiency and faulttolerance in cooperative computing and in developing efficient and faulttolerant algorithms for distributed cooperative applications. Many algorithms have been developed for DoAll in various models of computation, including messagepassing, partitionable networks, and sharedmemory models and under various failure models. However, to the best of our knowledge, DoAll has not been studied under Byzantine processor failures, where a faulty processor may exhibit completely unconstrained behavior. Byzantine failures model any arbitrary type of processor malfunction, including for example, failures of individual components within the processors.
Randomization helps to perform independent tasks reliably, Random Structures and Algorithms
"... This paper is about algorithms that schedule tasks to be performed in a distributed failureprone environment, when processors communicate by messagepassing, and when tasks are independent and of unit length. The processors work under synchrony and may fail by crashing. Failure patterns are imposed ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
This paper is about algorithms that schedule tasks to be performed in a distributed failureprone environment, when processors communicate by messagepassing, and when tasks are independent and of unit length. The processors work under synchrony and may fail by crashing. Failure patterns are imposed by adversaries. The question how the power of adversaries affects the optimality of randomized algorithmic solutions is among the problems studied. Linearlybounded adversaries may fail up to a constant fraction of the processors. Weaklyadaptive adversaries have to select, prior to the start of an execution, a subset of processors to be failureprone, and then may fail only the selected processors, at arbitrary steps, in the course of the execution. Strongly adaptive adversaries have a total number of failures as the only restriction on failure patterns. The measures of complexity are work, measured as the available processor steps, and communication, measured as the number of pointtopoint messages. A randomized algorithm is developed, that attains both O(n log ∗ n) expected work and O(n log ∗ n) expected communication, against weaklyadaptive linearlybounded adversaries, in the case when the numbers of tasks and processors are both equal to n. This is in contrast with the performance of algorithms against stronglyadaptive linearlybounded adversaries, that has to be Ω(n log n / log log n) in terms of work. Key words: distributed algorithm, randomized algorithm, message passing, crash failures, adaptive adversary, independent tasks, load balancing, lower bound.