The LAM/MPI checkpoint/restart framework: System-initiated checkpointing (2003)

by Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Andrew Lumsdaine
Venue:in Proceedings, LACSI Symposium, Sante Fe
Citations:84 - 8 self

Active Bibliography

1 Parallel Checkpoint/Restart for MPI Applications – Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine
3 Time-based coordinated checkpointing – Nuno F. Neves - 1998
14 Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI – Camille Coti, Thomas Herault, Pierre Lemarinier, Ala Rezmerita, Eric Rodriguez - 2006
1 Egida: A Toolkit for Low-overhead Fault-tolerance – Sriram S. Rao, Sriram S. Rao, Ph. D, Supervisors Lorenzo Alvisi, Harrick M. Vin - 1999
A PREEMPTION-BASED META-SCHEDULING SYSTEM FOR DISTRIBUTED COMPUTING – Sathish Vadhiyar - 2003
6 Interconnect agnostic checkpoint/restart in Open MPI – Joshua Hursey, Timothy I. Mattox, Andrew Lumsdaine - 2009
7 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI – Aurelien Bouteiller , Franck Cappello , Thomas Herault, Geraud Krawezik, Pierre Lemarinier , Frederic Magniette
21 Network Multicomputing Using Recoverable Distributed Shared Memory – John B. Carter, Alan L. Cox, Sandhya Dwarkadas, Hya Dwarkadas, Elmootazbellah N. Elnozahy, Pete Keleher, David B. Johnson, Steven Rodrigues, Weimin Yu, Willy Zwaenepoel - 1993
18 RENEW: A tool for fast and efficient implementation of checkpoint protocols – Nuno Neves, W. Kent Fuchs - 1998
47 Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery – Elmootazbellah N. Elnozahy, James S. Plank - 2004
1 Fault Manager for Distributed Operating Environments Design, Implementation, and Performance – Pierre Sens, Bertil Folliot - 1998
c ○ Copyright by Chao Huang, 2004SYSTEM SUPPORT FOR CHECKPOINT AND RESTART OF CHARM++ AND AMPI APPLICATIONS BY – Chao Huang
5 Recent Advances in Checkpoint/Recovery Systems – Greg Bronevetsky, Rohit Fern, Daniel Marques, Keshav Pingali, Paul Stodghill
Dr. D.K. Panda, Adviser – Karthik Gopalakrishnan B. E, Karthik Gopalakrishnan
23 Application-transparent checkpoint/restart for MPI programs over InfiniBand – Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda - 2006
25 The design and implementation of checkpoint/restart process fault tolerance for Open MPI – Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, Andrew Lumsdaine - 2007
40 An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance – James Plank - 1997
33 Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems – James S. Plank, Michael G. Thomason - 2001
A Checkpointing Protocol Based on a Minimal Characterization of the "No-Z-Cycle" Property – Francesco Quaglia, Roberto Baldoni, Bruno Ciciani - 1999