• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 521
Next 10 →

Checkpointing and Rollback-Recovery for Disitributed Systems

by Richard Koo, Sam Toueg - IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL , 1987
"... We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consiste ..."
Abstract - Cited by 366 (0 self) - Add to MetaCart
consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a

Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging

by C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, Peter Schwarz - ACM Transactions on Database Systems , 1992
"... In this paper we present a simple and efficient method, called ARIES ( Algorithm for Recouery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, finegranularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repea ..."
Abstract - Cited by 388 (10 self) - Add to MetaCart
of features that are very Important in building and operating an industrial-strength transaction processing system ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e. g., increment /decrement) which exploit the semantics

The LAM/MPI checkpoint/restart framework: System-initiated checkpointing

by Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine - in Proceedings, LACSI Symposium, Sante Fe , 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract - Cited by 109 (10 self) - Add to MetaCart
and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used

A higher order estimate of the optimum checkpoint interval for restart dumps

by J. T. Daly - Future Generation Computer Systems , 2006
"... This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-or ..."
Abstract - Cited by 123 (5 self) - Add to MetaCart
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first

The design and implementation of Berkeley Lab’s linux Checkpoint/Restart

by Jason Duell , 2003
"... Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance ..."
Abstract - Cited by 126 (4 self) - Add to MetaCart
Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance

Berkeley lab checkpoint/restart (blcr) for linux clusters

by Paul H Hargrove, Jason C Duell - in In Proceedings of SciDAC 2006 , 2006
"... Abstract. This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level soluti ..."
Abstract - Cited by 77 (0 self) - Add to MetaCart
Abstract. This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application

Affinity-Aware Checkpoint Restart

by Ajay Saini, Arash Rezaei, Frank Mueller, Paul Hargrove, Eric Roman
"... Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application perfor-mance after restart from a checkpoint for a number of ap-plications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i.e., appl ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior application perfor-mance after restart from a checkpoint for a number of ap-plications. This is due to a lack of page and core affinity awareness of the checkpoint/restart (C/R) mechanism, i

A Survey of Checkpoint/Restart Implementations

by Eric Roman - Lawrence Berkeley National Laboratory, Tech , 2002
"... In this paper we evaluate candidates for a checkpoint/restart implementation against a common set of requirements. Overall characteristics of the two main classes of checkpoint systems, library and system, are discussed followed by specific examples from existing systems. A detailed description of t ..."
Abstract - Cited by 30 (1 self) - Add to MetaCart
In this paper we evaluate candidates for a checkpoint/restart implementation against a common set of requirements. Overall characteristics of the two main classes of checkpoint systems, library and system, are discussed followed by specific examples from existing systems. A detailed description

CRAK: Linux Checkpoint/Restart As a Kernel Module

by Hua Zhong, Jason Nieh , 2001
"... Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift of co ..."
Abstract - Cited by 47 (1 self) - Add to MetaCart
Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift

The design and implementation of Zap: A system for migrating computing environments

by Steven Osman, Dinesh Subhraveti, Gong Su, Jason Nieh - In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002 , 2002
"... We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decoupl ..."
Abstract - Cited by 233 (26 self) - Add to MetaCart
. This decouples processes in pods from dependencies to the host operating system and other processes on the system. By integrating Zap virtualization with a checkpoint-restart mechanism, Zap can migrate a pod of processes as a unit among machines running independent operating systems without leaving behind any
Next 10 →
Results 1 - 10 of 521
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University