@MISC{Ziarek05stabilizers:a, author = {Lukasz Ziarek and Philip Schatz and Suresh Jagannathan}, title = {Stabilizers: A Safe Lightweight Checkpointing Abstraction for Concurrent Programs}, year = {2005} }
Share
OpenURL
Abstract
A checkpoint is a mechanism that allows program execution to be restarted from a previously saved state. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. While relatively straightforward to describe in a sequential setting, for example through the capture and application of continuations, it is less clear how to ascribe a meaningful semantics for safe checkpoints in the presence of concurrency. For a thread to correctly resume execution from a saved checkpoint, it must ensure that all other threads which have witnessed its unwanted effects after the establishment of the checkpoint are also reverted to a meaningful earlier state. If this is not done, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global state is not straightforward since thread interactions are a dynamic property of the program; requiring applications to specify such states explicitly is not pragmatic. In this paper, we present a safe and efficient on-the-fly checkpointing mechanism for concurrent programs. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread checkpoints and the restoration of globally consistent checkpoints. Global checkpoints are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our implementation results show that the memory and computation overheads for using stabilizers average roughly 4 to 6 % on our benchmark suite, leading us to conclude that stabilizers are a viable mechanism for defining restorable state in concurrent programs.