Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)

by Adnan M. Agbaria, et al.
Citations:81 - 6 self

Documents Related by Co-Citation

175 CoCheck: Checkpointing and Process Migration for MPI – Georg Stellner - 1996
474 A Survey of Rollback-Recovery Protocols in Message-Passing Systems – E. N. ( Mootaz) Elnozahy, Lorenzo Alvisi, Yi-min Wang, David B. Johnson - 1996
60 CLIP: A Checkpointing Tool for Message-Passing Parallel Programs – James S. Plank, Yuqun Chen, Kai Li - 1997
105 Checkpoint and migration of UNIX processes in the condor distributed processing system – M Litzkow, T Tannenbaum, J Basney, M Livny - 1997
34 Egida: An extensible toolkit for low-overhead fault-tolerance – Sriram Rao, Lorenzo Alvisi, Harrick M. Viny, Department Computer Sciences - 1999
85 FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world – Graham E. Fagg, Jack J. Dongarra - 2000
651 A high-performance, portable implementation of the MPI message passing interface standard – Ewing Lusk, Nathan Doss, Anthony Skjellum - 1996
181 Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit – Elmootazbellah N. Elnozahy, Willy Zwaenepoel - 1992
251 Libckpt: Transparent Checkpointing under Unix – James S. Plank, Micah Beck, Gerry Kingsley, Kai Li - 1995
57 A Network-Failure-tolerant Message-Passing system for Terascale Clusters – Richard L. Graham, Sung-eun Choi, David J. Daniel, Nehal N. Desai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, Mitchel W. Sukalski Introduction - 2003
94 MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes – George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, Anton Selikhov - 2002
929 Distributed Snapshots: Determining Global States of Distributed Systems – K. Mani Chandy - 1985
18 MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing – Rajanikanth Batchu, Jothi P. Neelamegam, Zhenqian Cui, Murali Beddhu, Anthony Skjellum, Yoginder D - 2001
284 Optimistic recovery in distributed systems – Robert E. Strom, Shaula Yemini - 1985
48 Application Level Fault Tolerance in Heterogeneous Networks of Workstations – Adam Beguelin, Erik Seligman, Peter Stephan - 1997
65 The condor distributed processing system – T Tannenbaum, M Litzkow - 1995
51 Managing Checkpoints for Parallel Programs – Jim Pruyne, Miron Livny
41 HARNESS: A Next Generation Distributed Virtual Machine – Micah Beck, Jack J. Dongarra, Graham E. Fagg, G. Al Geist, Paul Gray, James Kohl, Mauro Migliardi, Keith Moore, Terry Moore, Philip Papadopoulous, Stephen L. Scott, Vaidy Sunderam - 1998
30 J.J.: HARNESS and fault tolerant MPI – G E Fagg, A Bukovsky, Dongarra - 2001