Parallel Checkpoint/Restart for MPI Applications

Cached

Download Links

by Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Andrew Lumsdaine
Citations:1 - 0 self

Active Bibliography

84 The LAM/MPI checkpoint/restart framework: System-initiated checkpointing – Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine - 2003
3 Time-based coordinated checkpointing – Nuno F. Neves - 1998
1 Egida: A Toolkit for Low-overhead Fault-tolerance – Sriram S. Rao, Sriram S. Rao, Ph. D, Supervisors Lorenzo Alvisi, Harrick M. Vin - 1999
A PREEMPTION-BASED META-SCHEDULING SYSTEM FOR DISTRIBUTED COMPUTING – Sathish Vadhiyar - 2003
14 Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI – Camille Coti, Thomas Herault, Pierre Lemarinier, Ala Rezmerita, Eric Rodriguez - 2006
23 Application-transparent checkpoint/restart for MPI programs over InfiniBand – Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda - 2006
7 MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI – Aurelien Bouteiller , Franck Cappello , Thomas Herault, Geraud Krawezik, Pierre Lemarinier , Frederic Magniette
21 Network Multicomputing Using Recoverable Distributed Shared Memory – John B. Carter, Alan L. Cox, Sandhya Dwarkadas, Hya Dwarkadas, Elmootazbellah N. Elnozahy, Pete Keleher, David B. Johnson, Steven Rodrigues, Weimin Yu, Willy Zwaenepoel - 1993
6 Interconnect agnostic checkpoint/restart in Open MPI – Joshua Hursey, Timothy I. Mattox, Andrew Lumsdaine - 2009
18 RENEW: A tool for fast and efficient implementation of checkpoint protocols – Nuno Neves, W. Kent Fuchs - 1998
47 Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery – Elmootazbellah N. Elnozahy, James S. Plank - 2004
197 The Performance of Consistent Checkpointing – Elmootazbellah Nabil Elnozahy, David B. Johnson, Willy Zwaenepoel - 1992
1 Fault Manager for Distributed Operating Environments Design, Implementation, and Performance – Pierre Sens, Bertil Folliot - 1998
52 On the Use and Implementation of Message Logging – Elmootazbellah Elnozahy, Willy Zwaenepoel - 1994
61 Lazy Checkpoint Coordination for Bounding Rollback Propagation – Yi-min Wang, W. Kent Fuchs - 1993
A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing Systems – Rachit Garg, Praveen Kumar
A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems – Rachit Garg, Praveen Kumar
Backward error recovery . . . – Sunil KumarGupta, R. K Chauhan, Parveen Kumar
c ○ Copyright by Chao Huang, 2004SYSTEM SUPPORT FOR CHECKPOINT AND RESTART OF CHARM++ AND AMPI APPLICATIONS BY – Chao Huang