Results 1 -
8 of
8
Non-Blocking Checkpointing for Optimistic Parallel Simulation Description And . . .
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2003
"... This paper describes a non-blocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g. event list update, event execution), with the aim at removing the co ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper describes a non-blocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g. event list update, event execution), with the aim at removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting non-blocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard o#-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks we show that, except for the case of minimal state granularity applications, non-blocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a Personal Communication System (PCS) simulation is additionally reported to point out the benefits from non-blocking checkpointing for a real world application.
Fast-Software-Checkpointing in Optimistic Simulation: Embedding State Saving into the Event Routine Instructions
- Instructions”, Proc. 13th Workshop on Parallel and Distributed Simulation (PADS’99
, 1999
"... In this paper we present a software approach, namely Fast-Software-Checkpointing (FSC), to reduce the running time of the state saving protocol in optimistic parallel discrete event simulation. The idea behind FSC is to use the instructions performed during the execution of an event as part of the s ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper we present a software approach, namely Fast-Software-Checkpointing (FSC), to reduce the running time of the state saving protocol in optimistic parallel discrete event simulation. The idea behind FSC is to use the instructions performed during the execution of an event as part of the state saving protocol, hence the total number of instructions due to state saving is reduced. Under FSC the time for saving the state of a logical process prior to the execution of an event e requires an amount of time proportional to the amount of state variables not updated by e's execution, as only these variables must be copied. This outlines that FSC shows some dualism with respect to incremental state saving. We show, however, that there exists a basic difference between the two solutions as in FSC some of the state saving instructions are actually event routine instructions, while in incremental state saving they are only added and mixed to the latter ones. We also present a simple sof...
Optimistic Distributed Simulation Based on Transitive Dependency Tracking
- Proc. 11th Workshop on Parallel and Distributed Simulation
, 1997
"... In traditional optimistic distributed simulation protocols, a logical process(LP) receiving a straggler rolls back and sends out anti-messages. Receiver of an anti-message may also roll back and send out more anti-messages. So a single straggler may result in a large number of anti-messages and mult ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In traditional optimistic distributed simulation protocols, a logical process(LP) receiving a straggler rolls back and sends out anti-messages. Receiver of an anti-message may also roll back and send out more anti-messages. So a single straggler may result in a large number of anti-messages and multiple rollbacks of some LPs. In our protocol, an LP receiving a straggler broadcasts its rollback. On receiving this announcement, other LPs may roll back but they do not announce their rollbacks. So each LP rolls back at most once in response to each straggler. Antimessages are not used. This eliminates the need for output queues and results in simple memory management. It also eliminates the problem of cascading rollbacks and echoing, and results in faster simulation. All this is achieved by a scheme for maintaining transitive dependency information. The cost incurred includes the tagging of each message with extra dependency information and the increased processing time upon receiving a me...
Transparent state management for optimistic syn-chronization in the High Level Architecture
- In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation
, 2005
"... On behalf of: ..."
Probabilistic Checkpointing in Time Warp Parallel Simulation
- Proceedings of the 8th International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2000
"... In the Time Warp (TW) protocol, the system state must be checkpointed to facilitate the rollback operation. While increasing the checkpointing frequency increases the state saving cost, an infrequent scheme also escalates the coast forward effort when a large number of executed events are redone. Th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In the Time Warp (TW) protocol, the system state must be checkpointed to facilitate the rollback operation. While increasing the checkpointing frequency increases the state saving cost, an infrequent scheme also escalates the coast forward effort when a large number of executed events are redone. This paper proposes a probabilistic approach to checkpointing. We derive the rollback probability, and compute the expected coast forward effort if a state is not saved. To reduce implementation overheads, the rollback probability and coast forward cost are predetermined and make available at runtime as a lookup table. Based on the derived expectation, a state vector is saved only if the expected coast forward effort is larger than the state saving cost and vice versa. Our experiments show that the cost model reduces the simulation elapsed time by close to 30% as compared to saving the system state after each event execution, and saving the system state at a predefined interval.
Modeling and Optimization of Non-Blocking Checkpointing for Optimistic Simulation on Myrinet Clusters
- Journal of Parallel and Distributed Computing
, 2005
"... Checkpointing-and-Communication Library (CCL) is a recently developed software which implements CPU offloaded, non-blocking checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. This is achieved by exploiting data transfer capabilities provided by a progr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Checkpointing-and-Communication Library (CCL) is a recently developed software which implements CPU offloaded, non-blocking checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. This is achieved by exploiting data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as the maintenance of data consistency, thus adding overhead to (otherwise CPU cost-free) non-blocking checkpoint operations. In this paper we present a detailed cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization. With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model.
Parallel Discrete-Event Simulation
, 2009
"... Parallel discrete-event simulation (PDES), simply referred to as parallel simulation, is concerned with the execution of discrete-event simulation on parallel computers. PDES has been recognized as a challenging research field bridging between modeling and simulation, and high-performance computing. ..."
Abstract
- Add to MetaCart
Parallel discrete-event simulation (PDES), simply referred to as parallel simulation, is concerned with the execution of discrete-event simulation on parallel computers. PDES has been recognized as a challenging research field bridging between modeling and simulation, and high-performance computing. By exploiting the potential parallelism in a simulation model, PDES can overcome the limitations imposed by sequential simulation both in the execution time and the memory space, and therefore demonstrate as a viable technique for solving large-scale complex models. In this article, we provide a brief overview of the current state of PDES, identify its fundamental challenges, and discuss existing principal solutions resulted from three decades of intensive research in this field. Further, we report specific research advances in high-performance modeling and simulation of large-scale computer networks as an exemplar of typical PDES applications. 1 Simulation and Parallelization Methods

