Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS (1998)
| Venue: | 2nd SIGMETRICS Symposium on Parallel and Distributed Tools |
| Citations: | 12 - 2 self |
BibTeX
@INPROCEEDINGS{Kohl98efficientand,
author = {James Arthur Kohl and Philip M. Papadopoulos},
title = {Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS},
booktitle = {2nd SIGMETRICS Symposium on Parallel and Distributed Tools},
year = {1998},
pages = {60--71},
publisher = {ACM Press}
}
Years of Citing Articles
OpenURL
Abstract
Many practical scientific computer applications would benefit from a simple checkpointing mechanism that provides automatic restart or recovery in response to faults and failures, and enables dynamic load balancing and improved resource utilization using task migration. However, developing applications with such capabilities, especially in distributed, heterogeneous operating environments, is very challenging. CUMULVS is a middleware infrastructure for interacting with parallel scientific simulation programs and supports online visualization and computational steering. Using semantic information provided by user-level specifications of selected program variables, CUMULVS interprets distributed data decompositions across heterogeneous collections of computing resources. It extracts and assembles subsets of local decomposed application data to form global views of the data. The base CUMULVS system has been extended to provide a user-level mechanism that assists in the collection of check...







