The design and implementation of checkpoint/restart process fault tolerance for Open MPI (2007)

by J Hursey, J M Squyres, T I Mattox, A Lumsdaine
Venue:In Proc. of Intl. Parallel and Distributed Processing Symposium