Results 1 - 10
of
16
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
- In 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS2012
, 2012
"... Abstract—High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
(Show Context)
Abstract—High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated checkpointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the senddeterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution. Keywords-High performance computing, MPI, fault tolerance, send-determinism, failure containment I.
Unified model for assessing checkpointing protocols at extreme-scale
, 2012
"... In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated checkpoint st ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
(Show Context)
In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of parameters that are crucial to instantiate and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. This comparison outlines the comparative behaviors of checkpoint strategies at scale, thereby providing insight that is hardly accessible to direct experimentation.
Correlated set coordination in fault tolerant message logging protocols for many-core clusters
, 2013
"... ..."
(Show Context)
F.: On the use of clusterbased partial message logging to improve fault tolerance for mpi hpc applications
- In: Euro-Par
, 2011
"... Abstract. Fault tolerance is becoming a major concern in HPC sys-tems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Mes ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Fault tolerance is becoming a major concern in HPC sys-tems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchi-cal rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial mes-sage logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clus-ters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications. 1
Evaluation of simple causal message logging for large-scale fault tolerant hpc systems
- in 16th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems in 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2011
, 2011
"... Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minut ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
(Show Context)
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporating some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG, BT and DT) on 1024 processors, simple causal message logging has a latency overhead below 5%. Keywords-causal message logging; pessimistic message logging; migratable objects; parallel applications. I.
Dynamic load balance for optimized message logging in fault tolerant hpc applications
- in IEEE International Conference on Cluster Computing (Cluster) 2011
, 2011
"... Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of computational scien-tists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of proc ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of computational scien-tists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10 % of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team checkpointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the applica-tion into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging. Keywords-load balancing, causal message logging, fault tol-erance. I.
SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing
"... The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any messagepassing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a propert ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any messagepassing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems
"... to use novel techniques to allow scientific applications to progress despite frequent failures. Checkpoint-Restart is currently the most popular way to mitigate the impact of failures during longrunning executions. Different techniques try to reduce the cost of Checkpoint-Restart, some of them such ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
to use novel techniques to allow scientific applications to progress despite frequent failures. Checkpoint-Restart is currently the most popular way to mitigate the impact of failures during longrunning executions. Different techniques try to reduce the cost of Checkpoint-Restart, some of them such as local checkpointing and erasure codes aim to reduce the time to checkpoint while others such as uncoordinated checkpoint and message-logging aim to decrease the cost of recovery. In this paper, we study how to combine all these techniques together in order to optimize both: checkpointing and recovery. We present several clustering and topology challenges that lead us to an optimization problem in a four-dimensional space: reliability level, recovery cost, encoding time and message logging overhead. We propose a novel clustering method inspired from brain topology studies in neuroscience and evaluate it with a Tsunami simulation application in TSUBAME2. Our evaluation with 1024 processes shows that our novel clustering method can guarantee good performance for all of the four mentioned dimensions of our optimization problem. I.
at Extreme-Scale
, 2012
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
for Assessing Checkpointing Protocols at Extreme-Scale
"... Abstract: In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated che ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of parameters that are crucial to instantiate and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. This comparison outlines the comparative behaviors of checkpoint strategies at scale, thereby providing insight that is hardly accessible to direct experimentation. Key-words: Fault-tolerance, checkpointing, coordinated, hierarchical, model, exascale