@MISC{Chandra_checkpointingand, author = {Subhachandra Chandra and Peter M. Chen}, title = {Checkpointing and the Fail-Stop Model}, year = {} }
Share
OpenURL
Abstract
: This paper explores the relationship between checkpointing and the fail-stop model of process crashes. Checkpointing affects the fail-stop nature of an application for two reasons. First, checkpointing saves the complete state of the process to stable storage, making it more likely that stable storage will contain corrupted data. Second, checkpointing saves data to stable storage more often to provide better failure semantics. This frequent saving of data limits the time that programs have available to detect errors and halt. We measure experimentally the fail-stop behavior of three large applications and find that a significant number of crashes violate the fail-stop model. We also measure how effective progressive retry is at recovering from fail-stop violations. We find that 30-80% of fail-stop violations need to roll back more than 10 checkpoints to recover successfully. 1. Introduction As an ever-increasing number of applications depend on computers, the reliability of compute...