Results 1 - 10
of
15
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
- In Supercomputing
, 2002
"... Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or min ..."
Abstract
-
Cited by 94 (10 self)
- Add to MetaCart
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes.
MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
- In SuperComputing 2003
, 2003
"... Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol t ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1. 1
Fault Tolerance in MPI Programs
- Special issue of the Journal High Performance Computing Applications (IJHPCA
, 2002
"... This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modify ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI speci cation. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit signi cant degrees of fault tolerance.
Fault Tolerant Communication Library and Applications for High Performance Computing
- In Los Alamos Computer Science Institute Symposium
, 2003
"... With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becomin more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics o ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becomin more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications are presented. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
Proactive fault tolerance in mpi applications via task migration
- In International Conference on High Performance Computing
, 2006
"... Abstract. Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors wher ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications. 1
Process fault tolerance: semantics, design and applications for high performance computing
- International Journal for High Performance Applications and Supercomputing
, 2004
"... With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the sema ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
MPI/FT: A Model-based Approach to Low-overhead Fault Tolerant Message-passing Middleware
"... Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.
Scalable fault tolerant MPI: extending the recovery algorithm," Euro PVM/MPI
- Proc. 12th Euro PVM/MPI
, 2005
"... Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to m ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results. 2
VolpexMPI: an MPI Library for Execution of Parallel Applications on Volatile Nodes
"... Abstract. The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changin ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Abstract. The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks speeds and node execution speeds. Process replication is employed to provide robustness in such volatile environments. The central challenge in VolpexMPI design is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and a prototype implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications having a favorable ratio of communication to computation and a low degree of communication. 1
MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI
"... High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on design ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We present then four fault tolerant protocols implemented in a new generic framework for fault tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a microbenchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault tolerant protocol comparison for MPI applications.

