The LAM/MPI checkpoint/restart framework: System-initiated checkpointing (2003)

by Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Andrew Lumsdaine
Venue:in Proceedings, LACSI Symposium, Sante Fe
Citations:67 - 7 self

Active Bibliography

1 Parallel Checkpoint/Restart for MPI Applications – Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine
8 Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI – Camille Coti, Thomas Herault, Pierre Lemarinier, Ala Rezmerita, Eric Rodriguez - 2006
2 Time-based coordinated checkpointing – Nuno F. Neves - 1998
1 Egida: A Toolkit for Low-overhead Fault-tolerance – Sriram S. Rao, Sriram S. Rao, Ph. D, Supervisors Lorenzo Alvisi, Harrick M. Vin - 1999
A PREEMPTION-BASED META-SCHEDULING SYSTEM FOR DISTRIBUTED COMPUTING – Sathish Vadhiyar - 2003
3 Interconnect agnostic checkpoint/restart in Open MPI – Joshua Hursey, Timothy I. Mattox, Andrew Lumsdaine - 2009
5 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI – Aurelien Bouteiller , Franck Cappello , Thomas Herault, Geraud Krawezik, Pierre Lemarinier , Frederic Magniette
20 Network Multicomputing Using Recoverable Distributed Shared Memory – John B. Carter, Alan L. Cox, Sandhya Dwarkadas, Hya Dwarkadas, Elmootazbellah N. Elnozahy, Pete Keleher, David B. Johnson, Steven Rodrigues, Weimin Yu, Willy Zwaenepoel - 1993
17 RENEW: A tool for fast and efficient implementation of checkpoint protocols – Nuno Neves, W. Kent Fuchs - 1998
28 Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery – Elmootazbellah N. Elnozahy, James S. Plank - 2004
2 Recent Advances in Checkpoint/Recovery Systems – Greg Bronevetsky, Rohit Fern, Daniel Marques, Keshav Pingali, Paul Stodghill
1 Fault Manager for Distributed Operating Environments Design, Implementation, and Performance – Pierre Sens, Bertil Folliot - 1998
Dr. D.K. Panda, Adviser – Karthik Gopalakrishnan B. E, Karthik Gopalakrishnan
18 Application-transparent checkpoint/restart for MPI programs over InfiniBand – Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda - 2006
14 The design and implementation of checkpoint/restart process fault tolerance for Open MPI – Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, Andrew Lumsdaine - 2007
38 An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance – James Plank - 1997
22 Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems – James S. Plank, Michael G. Thomason - 2001
apport de rechercheA Framework for High Availability Based on a Single System Image – Geoffroy Vallée, Christine Morin, Stephen L. Scott, Geoffroy Vallée, Christine Morin, Stephen L. Scott, Projet Paris
A Framework for High Availability Based on a Single System Image – Èmes Al, Geoffroy Vallée, Christine Morin, Stephen L, Geoffroy Vallée, Christine Morin, Stephen L. Scott, Projet Paris