Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)

by Adnan M. Agbaria, et al.
Citations:87 - 6 self

Active Bibliography

3 Time-based coordinated checkpointing – Nuno F. Neves - 1998
24 Transparent Adaptive Parallelism on NOWs using OpenMP – Alex Scherer, Honghui Lu, Thomas Gross, Willy Zwaenepoel - 1999
47 Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery – Elmootazbellah N. Elnozahy, James S. Plank - 2004
1 Parallel Checkpoint/Restart for MPI Applications – Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine
86 The LAM/MPI checkpoint/restart framework: System-initiated checkpointing – Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine - 2003
2 Overcoming Byzantine Failures Using Checkpointing – Adnan Agbaria , Roy Friedman
52 On the Use and Implementation of Message Logging – Elmootazbellah Elnozahy, Willy Zwaenepoel - 1994
7 Declarative failure recovery for sensor networks – Ramakrishna Gummadi, Nupur Kothari
5 Distributed Snapshots for Mobile Computing Systems – Adnan Agbaria, William H. Sanders
3 Application-Driven Coordination-Free Distributed Checkpointing – Adnan Agbaria, William H. Sanders
15 Middleware support for distributed multimedia and collaborative computing – Kenneth P. Birman - 1998
11 Transparent Fault-Tolerant Java Virtual Machine – Roy Friedman, Alon Kama - 2003
4 Transparent Process Rollback Recovery: Some New Techniques And A Portable Implementation – Ernest Lloyd Ellenberger, Nitin H. Vaidya, Jennifer L. Welch, Richard A. Volz - 1995
4 Virtual Machine Based Heterogeneous Checkpointing – Adnan Agbaria, Roy Friedman - 2002
A PREEMPTION-BASED META-SCHEDULING SYSTEM FOR DISTRIBUTED COMPUTING – Sathish Vadhiyar - 2003
12 survey of fault-tolerance and fault-recovery techniques in parallel systems – Michael Treaster - 2005
25 Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques – James S. Plank - 1996
38 An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance – James Plank - 1997
A Checkpointing Protocol Based on a Minimal Characterization of the "No-Z-Cycle" Property – Francesco Quaglia, Roberto Baldoni, Bruno Ciciani - 1999