Results 1 - 10
of
521
Checkpointing and Rollback-Recovery for Disitributed Systems
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL
, 1987
"... We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consiste ..."
Abstract
-
Cited by 366 (0 self)
- Add to MetaCart
consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a
Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging
- ACM Transactions on Database Systems
, 1992
"... In this paper we present a simple and efficient method, called ARIES ( Algorithm for Recouery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, finegranularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repea ..."
Abstract
-
Cited by 388 (10 self)
- Add to MetaCart
of features that are very Important in building and operating an industrial-strength transaction processing system ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e. g., increment /decrement) which exploit the semantics
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used
A higher order estimate of the optimum checkpoint interval for restart dumps
- Future Generation Computer Systems
, 2006
"... This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-or ..."
Abstract
-
Cited by 123 (5 self)
- Add to MetaCart
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first
The design and implementation of Berkeley Lab’s linux Checkpoint/Restart
, 2003
"... Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance ..."
Abstract
-
Cited by 126 (4 self)
- Add to MetaCart
Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance
Berkeley lab checkpoint/restart (blcr) for linux clusters
- in In Proceedings of SciDAC 2006
, 2006
"... Abstract. This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level soluti ..."
Abstract
-
Cited by 77 (0 self)
- Add to MetaCart
Abstract. This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application
Affinity-Aware Checkpoint Restart
"... Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application perfor-mance after restart from a checkpoint for a number of ap-plications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., appl ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application perfor-mance after restart from a checkpoint for a number of ap-plications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i
A Survey of Checkpoint/Restart Implementations
- Lawrence Berkeley National Laboratory, Tech
, 2002
"... In this paper we evaluate candidates for a checkpoint/restart implementation against a common set of requirements. Overall characteristics of the two main classes of checkpoint systems, library and system, are discussed followed by specific examples from existing systems. A detailed description of t ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
In this paper we evaluate candidates for a checkpoint/restart implementation against a common set of requirements. Overall characteristics of the two main classes of checkpoint systems, library and system, are discussed followed by specific examples from existing systems. A detailed description
CRAK: Linux Checkpoint/Restart As a Kernel Module
, 2001
"... Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift of co ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift
The design and implementation of Zap: A system for migrating computing environments
- In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002
, 2002
"... We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decoupl ..."
Abstract
-
Cited by 233 (26 self)
- Add to MetaCart
. This decouples processes in pods from dependencies to the host operating system and other processes on the system. By integrating Zap virtualization with a checkpoint-restart mechanism, Zap can migrate a pod of processes as a unit among machines running independent operating systems without leaving behind any
Results 1 - 10
of
521