Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)
| Citations: | 81 - 6 self |
BibTeX
@MISC{Agbaria_starfish:fault-tolerant,
author = {Adnan M. Agbaria and et al.},
title = {Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations (Extended Abstract)},
year = {}
}
OpenURL
Abstract
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, faulttolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.







