Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters (2005)
Cached
Download Links
- [www.ncl.cs.columbia.edu]
- [www.cs.toronto.edu]
- [www.cs.columbia.edu]
- DBLP
Other Repositories/Bibliography
| Citations: | 4 - 1 self |
BibTeX
@MISC{Laadan05transparentcheckpoint-restart,
author = {Oren Laadan and et al.},
title = {Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters},
year = {2005}
}
OpenURL
Abstract
We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin virtualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can be restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low virtualization overhead and fast checkpointrestart times for distributed network applications without any application, library, kernel, or network protocol modifications.







