Results 1 - 10
of
215
Impact of event logger on causal message logging protocols for fault tolerant MPI
- In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Papers
, 2005
"... Abstract — Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a rece ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing. Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events
MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
- In SuperComputing 2003
, 2003
"... Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol t ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol
Coordinated checkpoint versus message log for fault tolerant MPI
"... MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpoi ..."
Abstract
-
Cited by 37 (8 self)
- Add to MetaCart
checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
- in IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS
, 2004
"... Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a h ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a
The Peer Sampling Service: Experimental Evaluation of Unstructured Gossip-Based Implementations
- In Middleware ’04: Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
, 2004
"... Abstract. In recent years, the gossip-based communication model in large-scale distributed systems has become a general paradigm with important applications which include information dissemination, aggregation, overlay topology management and synchronization. At the heart of all of these protocols l ..."
Abstract
-
Cited by 187 (41 self)
- Add to MetaCart
that the peers to send gossip messages to are selected uniformly at random from the set of all nodes. In practice—instead of requiring all nodes to know all the peer nodes so that a random sample could be drawn—a scalable and efficient way to implement the peer sampling service is by constructing and maintaining
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
- In 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS2012
, 2012
"... Abstract—High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after ..."
Abstract
-
Cited by 19 (9 self)
- Add to MetaCart
Abstract—High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after
F.: On the use of clusterbased partial message logging to improve fault tolerance for mpi hpc applications
- In: Euro-Par
, 2011
"... Abstract. Fault tolerance is becoming a major concern in HPC sys-tems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Mes ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clus-ters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message
Intercommunicator extensions to MPI in the MPIX (MPI eXtension) Library
, 1994
"... MPI is the new standard for multicomputer and cluster message passing introduced by the Message-Passing Interface Forum (MPIF) in April 1994. This paper describes the current inter-communicator interface found in MPI and the reasons for its current design. We also motivate the need for additional in ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
inter-communicator operations and introduce the extensions we have included in MPIX (MPI eXtension Library), a library of extensions to MPI that we are currently developing. Inter-communicators may be usedfor a variety of purposes such as in client/server applications (i.e., I/O and graphics servers
Efficient MPI for Virtual Interface (VI) Architecture
, 1999
"... Efficient Message Passing Interface implementations for emerging cluster interconnects are an important requirement for useful parallel processing on cost-effective clusters of NT workstations. This paper reports on a new implementation of MPI for VI Architecture networks. Support for high bandwidth ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Efficient Message Passing Interface implementations for emerging cluster interconnects are an important requirement for useful parallel processing on cost-effective clusters of NT workstations. This paper reports on a new implementation of MPI for VI Architecture networks. Support for high
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
- IN PROCEEDINGS OF THE 28TH IEEE FAULT-TOLERANT COMPUTING SYMPOSIUM (FTCS
, 1998
"... This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a pr ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
protocol in the system with reduced programming effort. To support a broad range of applications, RENEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a
Results 1 - 10
of
215