|
175
|
CoCheck: Checkpointing and Process Migration for MPI
– Georg Stellner
- 1996
|
|
474
|
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
– E. N. ( Mootaz) Elnozahy, Lorenzo Alvisi, Yi-min Wang, David B. Johnson
- 1996
|
|
60
|
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
– James S. Plank, Yuqun Chen, Kai Li
- 1997
|
|
105
|
Checkpoint and migration of UNIX processes in the condor distributed processing system
– M Litzkow, T Tannenbaum, J Basney, M Livny
- 1997
|
|
34
|
Egida: An extensible toolkit for low-overhead fault-tolerance
– Sriram Rao, Lorenzo Alvisi, Harrick M. Viny, Department Computer Sciences
- 1999
|
|
85
|
FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world
– Graham E. Fagg, Jack J. Dongarra
- 2000
|
|
651
|
A high-performance, portable implementation of the MPI message passing interface standard
– Ewing Lusk, Nathan Doss, Anthony Skjellum
- 1996
|
|
181
|
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit
– Elmootazbellah N. Elnozahy, Willy Zwaenepoel
- 1992
|
|
251
|
Libckpt: Transparent Checkpointing under Unix
– James S. Plank, Micah Beck, Gerry Kingsley, Kai Li
- 1995
|
|
57
|
A Network-Failure-tolerant Message-Passing system for Terascale Clusters
– Richard L. Graham, Sung-eun Choi, David J. Daniel, Nehal N. Desai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, Mitchel W. Sukalski Introduction
- 2003
|
|
94
|
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
– George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djailali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, Anton Selikhov
- 2002
|
|
929
|
Distributed Snapshots: Determining Global States of Distributed Systems
– K. Mani Chandy
- 1985
|
|
18
|
MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
– Rajanikanth Batchu, Jothi P. Neelamegam, Zhenqian Cui, Murali Beddhu, Anthony Skjellum, Yoginder D
- 2001
|
|
284
|
Optimistic recovery in distributed systems
– Robert E. Strom, Shaula Yemini
- 1985
|
|
48
|
Application Level Fault Tolerance in Heterogeneous Networks of Workstations
– Adam Beguelin, Erik Seligman, Peter Stephan
- 1997
|
|
65
|
The condor distributed processing system
– T Tannenbaum, M Litzkow
- 1995
|
|
51
|
Managing Checkpoints for Parallel Programs
– Jim Pruyne, Miron Livny
|
|
41
|
HARNESS: A Next Generation Distributed Virtual Machine
– Micah Beck, Jack J. Dongarra, Graham E. Fagg, G. Al Geist, Paul Gray, James Kohl, Mauro Migliardi, Keith Moore, Terry Moore, Philip Papadopoulous, Stephen L. Scott, Vaidy Sunderam
- 1998
|
|
30
|
J.J.: HARNESS and fault tolerant MPI
– G E Fagg, A Bukovsky, Dongarra
- 2001
|