Results 1 -
4 of
4
A job pause service under lam/mpi+blcr for transparent fault tolerance
- In International Parallel and Distributed Processing Symposium
, 2007
"... Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unne ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6 % is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on. 1
Fault Tolerant Network Routing through Software Overlays for Intelligent Power Grids
"... Control decisions of intelligent devices in critical infrastructure can have a significant impact on human life and the environment. Insuring that the appropriate data is available is crucial in making informed decisions. Such considerations are becoming increasingly important in today’s cyber-physi ..."
Abstract
- Add to MetaCart
Control decisions of intelligent devices in critical infrastructure can have a significant impact on human life and the environment. Insuring that the appropriate data is available is crucial in making informed decisions. Such considerations are becoming increasingly important in today’s cyber-physical systems that combine computational decision making on the cyber side with physical control on the device side. In an intelligent power system, power management of energy is provided in a highly distributed and scalable manner. The system has to insure that intelligent devices have the appropriate data to make control decisions for microgrids and with respect of microgrid connectivity to an upstream utility power grid. The job of insuring the timely arrival of the data falls onto the network designed to support these intelligent devices. This network needs to be fault tolerant. When nodes, devices or communication links fail along a default route of a message from A to B, the underlying hardware and software layers should ensure that this message will actually be delivered as long as alternative routes exist. Insuring multi-route pathways and discovery of these pathways is critical in insuring delivery of critical data. In this work, we propose methods of developing network topologies of smart devices that will enable multi-route discovery in an intelligent power grid. This will be accomplished through the utilization of software overlays (1) that maintain a digital representation of the physical network and (2) allow new route discovery in the case of fault. Our vision is that the application of this approach in an intelligent power grid will enable intelligent power devices to make automated, decentralized decisions and to maintain state of lower-level devices. 1.
ABSTRACT WANG, CHAO. Transparent Fault Tolerance for Job Healing in HPC Environments.
"... (Under the direction of Associate Professor Frank Mueller). As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown t ..."
Abstract
- Add to MetaCart
(Under the direction of Associate Professor Frank Mueller). As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the
The FREEDM Architecture of Fault Tolerant Network Routing through Software Overlays ∗
"... Control decisions of intelligent devices in critical infrastructure can have a significant impact on human life and the environment. Insuring that the appropriate data is available is crucial in making informed decisions. Such considerations are becoming increasingly important in today’s cyber-physi ..."
Abstract
- Add to MetaCart
Control decisions of intelligent devices in critical infrastructure can have a significant impact on human life and the environment. Insuring that the appropriate data is available is crucial in making informed decisions. Such considerations are becoming increasingly important in today’s cyber-physical systems that combine computational decision making on the cyber side with physical control on the device side. In the FREEDM system, power management of green energy is provided in a highly distributed and scalable manner. The system has to insure that Intelligent Energy Management (IEM) and Intelligent Fault Management devices have the appropriate data to make control decisions for microgrids and with respect of microgrid connectivity to an upstream utility power grid. The job of insuring the timely arrival of the data falls onto the network designed to support these intelligent devices. This network needs to be fault tolerant. When nodes, devices or communication links fail along a default route of a message from A to B, the underlying hardware and software layers should ensure that this message will actually be delivered as long as alternative routes exist. Insuring multi-route pathways and discovery of these pathways is critical in insuring delivery of critical data. In this work, we present methods of developing network topologies of smart devices that will enable multi-route discovery in an intelligent power grid. This will be accomplished through the utilization of software overlays (1) that maintain a digital representation of the physical network and (2) allow new route discovery in the case of fault. Also, in this work we aim to present a visualization of the connection states and pathways through the network aimed at helping external entities to understand the states of the network. Our vision is that the application of this approach in an intelligent power grid will enable IEM and IFM devices to make automated, decentralized decisions and to maintain state of lower-level devices.

