Results 1 - 10
of
12
Proactive fault tolerance for hpc with xen virtualization,” inICS ’07
- Proceedings of the 21st Annual International Conference on Supercomputing
, 2007
"... Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today’s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from “unhealthy ” nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring
Group-based coordinated checkpointing for mpi: A case study on infiniband
- Parallel Processing, 2007. ICPP 2007. International Conference on
, 2007
"... As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practi ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practice, coordinated checkpointing has a known limitation on scalability. Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications. In this paper, we propose a novel group-based checkpointing design to alleviate this scalability limitation. By carefully scheduling the MPI processes to take checkpoints in smaller groups, our design reduces the number of processes simultaneously taking checkpoints, while allowing those processes not taking checkpoints to proceed with computation. We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner. Experimental results show our group-based checkpointing design can reduce the effective delay for checkpointing significantly, up to 78 % for HPL and up to 70 % for MotifMiner. 1.
Proactive Process-Level Live Migration in HPC Environments
"... As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the proc ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one’s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. I.
Accelerating checkpoint operation by node-level write aggregation on multicore systems
, 2009
"... Abstract—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with incr ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size due to constraints within the file system. Furthermore, with the advent of multi-core architecture, the situation is aggravated due to larger number of processes running on the same node, trying to checkpoint simultaneously. This results in increased number of file writes at the time of checkpointing which leads to performance degradation. As a result, deployment of Checkpoint/Restart mechanisms for large scale parallel applications is limited. In this work, we explore the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library. Our profiling of the checkpoints for the NAS parallel benchmarks revealed a large number of small file writes interspersed with large writes. Based on these observation we propose to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes. At the aggregation threshold of 512KB, the implementation of our design in BLCR shows improvements from 27 % to 32 % over the original BLCR in terms of time cost to checkpoint an MPI application. I.
1 Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture
"... Abstract—Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-fact ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-facto standard for parallel programming, is widely used on such large clusters. Many MPI implementations use Checkpoint/Restart schemes using the Berkeley Lab Checkpoint Restart (BLCR) Library to achieve some level of fault tolerance. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size. As a result, the deployment of Checkpoint/Restart mechanisms for large scale parallel applications is compromised. In our previous work, we proposed a technique to aggregate certain categories of checkpoint writes to reduce the checkpointing overhead. However, an application still experiences slow checkpoint writing because it is blocked waiting for its checkpoint file writes to complete. In this paper, we propose the Write Aggregation with Dynamic Buffer and Interleaving scheme to reduce the overhead related to checkpoint creation. By aggregating all checkpoint writes into a dynamic buffer pool and overlapping the application progress with the file writes, our algorithm is able to significantly reduce checkpoint creation overhead. In the experiments using 64 processor cores, our design demonstrates a speedup of 2.62 times in terms of checkpoint creation time when compared to the original BLCR design. Our scheme also reduces the impact of checkpointing on the application execution time from 20 % to 6 % when 3 checkpoints are taken during an application run. I.
A Tunable Holistic Resiliency Approach for High-Performance Computing Systems
"... In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
RDMA-Based Job Migration Framework for MPI over InfiniBand
"... Abstract—Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly largesized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVA-PICH2 [1], an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters. I.
A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems
"... Despite years of study on failure prediction, it remains an open problem, especially in large-scale systems composed of vast amount of components. In this paper, we present a dynamic meta-learning framework for failure prediction. It intends to not only provide reasonable prediction accuracy, but al ..."
Abstract
- Add to MetaCart
Despite years of study on failure prediction, it remains an open problem, especially in large-scale systems composed of vast amount of components. In this paper, we present a dynamic meta-learning framework for failure prediction. It intends to not only provide reasonable prediction accuracy, but also be of practical use in realistic environments. Two key techniques are developed to address technical challenges of failure prediction. One is meta-learning to boost prediction accuracy by combining the benefits of multiple predictive techniques. The other is a dynamic approach to dynamically obtain failure patterns from a changing training set and to dynamically extract effective rules by actively monitoring prediction accuracy at runtime. We demonstrate the effectiveness and practical use of this framework by means of real system logs collected from the production Blue Gene/L systems at Argonne National Laboratory and San Diego Supercomputer Center. Our case studies indicate that the proposed mechanism can provide reasonable prediction accuracy by forecasting up to 82 % of the failures, with a runtime overhead less than 1.0 minute.
ABSTRACT WANG, CHAO. Transparent Fault Tolerance for Job Healing in HPC Environments.
"... (Under the direction of Associate Professor Frank Mueller). As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown t ..."
Abstract
- Add to MetaCart
(Under the direction of Associate Professor Frank Mueller). As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the
Dr. D.K. Panda, Adviser
"... There has been an unprecedented increase in the number of large scale computing clusters in the recent past. The advent of multi-core processors and high speed interconnects such as InniBand, which provide excellent performance at a reasonable cost, have contributed to this growth. However, the fail ..."
Abstract
- Add to MetaCart
There has been an unprecedented increase in the number of large scale computing clusters in the recent past. The advent of multi-core processors and high speed interconnects such as InniBand, which provide excellent performance at a reasonable cost, have contributed to this growth. However, the failure rate on these clusters has increased due to the increase in the number and scale of components. Thus, it has become vital for such systems to have fault tolerance capabilities. Checkpoint / Restart and Job Migration are commonly used techniques for fault tolerance through failure recovery in computing clusters. Since MPI is the de-facto standard for parallel programming, it is an excellent candidate where these fault tolerance features can be implemented without exposing the complexity of the implementation to end user applications. Furthermore, failure detection and propogation of the fault information is an equally important topic of research in large peta-scale clusters. In this thesis, we propose a design for a Checkpoint / Restart framework for

