Results 1 - 10
of
61
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
- In Supercomputing
, 2002
"... Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or min ..."
Abstract
-
Cited by 94 (10 self)
- Add to MetaCart
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes.
A Network-Failure-tolerant Message-Passing system for Terascale Clusters
- International Journal of Parallel Programming
, 2003
"... The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end networkfailure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate networkrelated failures including I/O bus errors, network card errors, and wire ..."
Abstract
-
Cited by 57 (16 self)
- Add to MetaCart
The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end networkfailure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate networkrelated failures including I/O bus errors, network card errors, and wire-transmission
MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
- In SuperComputing 2003
, 2003
"... Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol t ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1. 1
Proactive fault tolerance for hpc with xen virtualization,” inICS ’07
- Proceedings of the 21st Annual International Conference on Supercomputing
, 2007
"... Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today’s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from “unhealthy ” nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring
Conceptual and Implementation Models for the Grid
- In Proceedings of the IEEE, Special Issue on Grid Computing
, 2005
"... The Grid is rapidly emerging as the dominant paradigm for wide area distributed application systems. As a result, there is a need for modeling and analyzing the characteristics and requirements of Grid systems and programming models. This paper adopts the well-established body of models for distribu ..."
Abstract
-
Cited by 21 (11 self)
- Add to MetaCart
The Grid is rapidly emerging as the dominant paradigm for wide area distributed application systems. As a result, there is a need for modeling and analyzing the characteristics and requirements of Grid systems and programming models. This paper adopts the well-established body of models for distributed computing systems, which are based upon carefully stated assumptions or axioms, as a basis for defining and characterizing Grids and their programming models and systems. The requirements of programming Grid applications and the resulting requirements on the underlying virtual organizations and virtual machines are investigated. The assumptions underlying some of the programming models and systems currently used for Grid applications are identified and their validity in Grid environments is discussed. A more in-depth analysis of two programming systems, the Imperial College E-Science Networked Infrastructure (ICENI) and Accord, using the proposed definitions’ structure is presented. Keywords—Distributed systems, Grid programming models, Grid programming systems, Grid system definition. I.
A Simple MPI Process Swapping Architecture for Iterative
- Applications, The International Journal of High Performance Computing Applications
, 2004
"... Parallel computing is now popular and mainstream, but performance and ease-of-use remain elusive to many endusers. There exists a need for performance improvements that can be easily retrofitted to existing parallel applications. In this paper we present MPI process swapping, a simple performance en ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
Parallel computing is now popular and mainstream, but performance and ease-of-use remain elusive to many endusers. There exists a need for performance improvements that can be easily retrofitted to existing parallel applications. In this paper we present MPI process swapping, a simple performance enhancing add-on to the MPI programming paradigm. MPI process swapping improves performance by dynamically choosing the best available resources throughout application execution, using MPI process over-allocation and real-time performance measurement. Swapping provides fully automated performance monitoring and process management, and a rich set of primitives to control execution behavior manually or through an external tool. Swapping, as defined in this implementation, can be added to iterative MPI applications and requires as few as three lines of source code change. We verify our design for a particle dynamics application on desktop resources within a production commercial environment. 1.
MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
- In Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid
, 2001
"... MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off suf ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated. User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multithreaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT – real-time MPI – are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future. 1.
Architecture of LA-MPI, a network-fault-tolerant MPI
- In International Parallel and Distributed Processing Symposium
, 2004
"... We discuss the unique architectural elements of the Los Alamos Message Passing Interface (LA-MPI), a high-performance, network-fault-tolerant, thread-safe MPI library. LA-MPI is designed for use on terascale clusters which are inherently unreliable due to their sheer number of system components and ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
We discuss the unique architectural elements of the Los Alamos Message Passing Interface (LA-MPI), a high-performance, network-fault-tolerant, thread-safe MPI library. LA-MPI is designed for use on terascale clusters which are inherently unreliable due to their sheer number of system components and tradeoffs between cost and performance. We examine in detail the design concepts used to implement LA-MPI. These include reliability features, such as applicationlevel checksumming, message retransmission, and automatic message re-routing. Other key performance enhancing features, such as concurrent message routing over multiple, diverse network adapters and protocols, and communication-specific optimizations (e.g., shared memory) are examined. 1
Fault Tolerance in MPI Programs
- Special issue of the Journal High Performance Computing Applications (IJHPCA
, 2002
"... This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modify ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI speci cation. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit signi cant degrees of fault tolerance.
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
- in IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS
, 2004
"... Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a h ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. In this paper we extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of this paper is two folds: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.

