Results 1 - 10
of
41
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Dome: Parallel programming in a heterogeneous multi-user environment
, 1995
"... Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolera ..."
Abstract
-
Cited by 76 (4 self)
- Add to MetaCart
Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel ...
Adaptive Recovery for Mobile Environments
- Communications of the ACM
, 1997
"... Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host dis ..."
Abstract
-
Cited by 54 (6 self)
- Add to MetaCart
Mobile computing allows ubiquitous and continuous access to computing resources while the users travel or work at a client’s site. The flexibility introduced by mobile computing brings new challenges to the area of fault tolerance. Failures that were rare with fixed hosts become common, and host disconnection makes fault detection and message coordination difficult. This paper describes a new checkpoint protocol that is well adapted to mobile environments. The protocol uses time to indirectly coordinate the creation of new global states, avoiding all message exchanges. The protocol uses two different types of checkpoints to adapt to the current network characteristics, and to trade off performance with recovery time. 1
On the Use and Implementation of Message Logging
- In 24th International Symposium on Fault-Tolerant Computing
, 1994
"... Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on cl ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
Message logging has long been advocated as offering better failure-free performance than coordinated checkpointing. On the contrary, we present a number of experiments showing that for than coordinated checkpointing. Message logging protocolscompute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead , however, resuls in much shorter output latency than coordinated check pointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection and reduced complexity. Meanwhile the new protcolos, retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three "lessons learned" from an implementation of various message logging protocols. First, during output commit, only the dependency information for the message in the log needs to be written to the stable storage. It is not necessary to write the message data to stable storage, leading to faster output commit. Second, the use of copy-on-write in the implementation of message logging substantially reduces the logging overhead for communication-intensive programs. Finally, we provide quantitative evidence supporting previous qualitative claims about the superiority of sender-based message logging over receiver-based logging.
Application Level Fault Tolerance in Heterogeneous Networks of Workstations
- Journal of Parallel and Distributed Computing
, 1997
"... We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although t ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpoi...
On Coordinated Checkpointing in Distributed Systems
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1994
"... Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm [18] combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: There does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.
Message Logging in Mobile Computing
, 1999
"... Dependable mobile computing is enhanced by independent recovery, low power consumption and no dependence on stable storage at the mobile host. Existing recovery protocols proposed for mobile environments typically create consistent global checkpoints that do not guarantee independent recovery and lo ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
Dependable mobile computing is enhanced by independent recovery, low power consumption and no dependence on stable storage at the mobile host. Existing recovery protocols proposed for mobile environments typically create consistent global checkpoints that do not guarantee independent recovery and low power consumption. This paper demonstrates the advantages of message logging by describing a receiver based logging protocol. Checkpointing is utilized to limit log size and recovery latency. We compare the performance of our approach with that of existing mobile checkpointing and recovery algorithms in terms of failure free overhead and recovery time. We also describe a stable storage management scheme for mobile support stations. Garbage collection is achieved without direct participation of mobile hosts.
On the Impossibility of Min-Process Non-Blocking Checkpointing and an Efficient Checkpointing Algorithm for Mobile Computing Systems
, 1998
"... Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Prakash and Singhal [14] proposed the first coordinated checkpointing algorithm ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Prakash and Singhal [14] proposed the first coordinated checkpointing algorithm for mobile computing systems. However, we showed that their algorithm may result in an inconsistency [3]. In this paper, we prove a more general result about coordinated checkpointing: there does not exist a non-blocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on the proof, we propose an efficient algorithm for mobile computing systems, which forces only a minimum number of processes to take checkpoints and dramatically reduces the blocking time during the checkpointing process. Correctness proofs and performance analysis of the algorithm are provided.
Checkpointing With Mutable Checkpoints
- THEORETICAL COMPUTER SCIENCE 290 (2003) 1127 -- 1148
, 2003
"... There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing processnon-blD"V"V/ In our previous work (IEEE Paral"P Distributed Systems 9 (1 ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing processnon-blD"V"V/ In our previous work (IEEE Paral"P Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist anon-bl;DAAx alon-bl which forcesonl a minimum number of processes to take their checkpoints. In this paper, we present a min-processaln-proc whichrelh/D thenon-blA/;D# conditionwhil tries to minimize thebl/P"xV time, and anon-bl;DBB" aln-bl; whichrelh/x the min-process condition whil minimizing the number of checkpoints saved on the stable storage. The proposed non-bld/""" al-bld/ is based on the concept of"mutabl checkpoint", which is neither a tentative checkpoint nor a permanent checkpoint. Based onmutabl checkpoints, ournon-blE/;Dx alon-bl avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage.
High-Level Fault Tolerance in Distributed Programs
, 1994
"... We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment) , a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment) , a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expect...

