The LAM/MPI checkpoint/restart framework: System-initiated checkpointing (2003)

by Sriram Sankaran , Jeffrey M. Squyres , Brian Barrett , Andrew Lumsdaine
Venue:in Proceedings, LACSI Symposium, Sante Fe
Citations:67 - 7 self

Documents Related by Co-Citation

85 FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world – Graham E. Fagg, Jack J. Dongarra - 2000
175 CoCheck: Checkpointing and Process Migration for MPI – Georg Stellner - 1996
94 MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes – George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, Anton Selikhov - 2002
474 A Survey of Rollback-Recovery Protocols in Message-Passing Systems – E. N. ( Mootaz) Elnozahy, Lorenzo Alvisi, Yi-min Wang, David B. Johnson - 1996
105 Checkpoint and migration of UNIX processes in the condor distributed processing system – M Litzkow, T Tannenbaum, J Basney, M Livny - 1997
62 The design and implementation of Berkeley Lab’s linux Checkpoint/Restart – Jason Duell - 2003
929 Distributed Snapshots: Determining Global States of Distributed Systems – K. Mani Chandy - 1985
251 Libckpt: Transparent Checkpointing under Unix – James S. Plank, Micah Beck, Gerry Kingsley, Kai Li - 1995
119 Open MPI: Goals, concept, and design of a next generation MPI implementation – Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, Timothy S. Woodall - 2004
651 A high-performance, portable implementation of the MPI message passing interface standard – Ewing Lusk, Nathan Doss, Anthony Skjellum - 1996
57 A Network-Failure-tolerant Message-Passing system for Terascale Clusters – Richard L. Graham, Sung-eun Choi, David J. Daniel, Nehal N. Desai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, Mitchel W. Sukalski Introduction - 2003
84 Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract) – Adnan M. Agbaria, et al.
185 LAM: an open cluster environment for MPI – G Burns, R Daoud, J Vaigl - 1994
63 A Component Architecture for LAM/MPI – Jeffrey M. Squyres, Andrew Lumsdaine - 2003
181 Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit – Elmootazbellah N. Elnozahy, Willy Zwaenepoel - 1992
67 Automated Application-level Checkpointing of MPI Programs – Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill - 2003
18 Architecture of LA-MPI, a network-fault-tolerant MPI – Rob T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean Risinger, Mark A. Taylor, Timothy S. Woodall - 2004
30 J.J.: HARNESS and fault tolerant MPI – G E Fagg, A Bukovsky, Dongarra - 2001
34 Egida: An extensible toolkit for low-overhead fault-tolerance – Sriram Rao, Lorenzo Alvisi, Harrick M. Viny, Department Computer Sciences - 1999