Results 1 -
9 of
9
Proactive Process-Level Live Migration and Back Migration in HPC Environments
"... As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the proc ..."
Abstract
-
Cited by 37 (11 self)
- Add to MetaCart
(Show Context)
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one’s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70 % of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration will be.
A job pause service under LAM/MPI+BLCR for transparent fault tolerance
- In International Parallel and Distributed Processing Symposium
, 2007
"... Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unne ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
(Show Context)
Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communication with fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6 % is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on. 1
Load balancing in the bulk-synchronous-parallel setting using process migrations
- IEEE International Parallel and Distributed Processing Symposium (IEEE Cat No07TH8938
, 2007
"... The Paderborn University BSP (PUB) library is a powerful C library that supports the development of bulk synchronous parallel programs for various parallel machines. To utilize idle times on workstations for parallel computations, we implement virtual processors using processes. These processes can ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
The Paderborn University BSP (PUB) library is a powerful C library that supports the development of bulk synchronous parallel programs for various parallel machines. To utilize idle times on workstations for parallel computations, we implement virtual processors using processes. These processes can be migrated to other hosts, when the load of the machines changes. In this paper we describe the implementation for a Linux workstation cluster. We focus on process migration and show first benchmarking results. 1.
A comprehensive user-level checkpointing strategy . . .
"... As computational clusters increase in size, their mean-time-to-failure reduces drastically. After a failure, most MPI checkpointing solutions require a restart with the same number of nodes. This necessitates the availability of multiple spare nodes, leading to poor resource utilization. Moreover, m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
As computational clusters increase in size, their mean-time-to-failure reduces drastically. After a failure, most MPI checkpointing solutions require a restart with the same number of nodes. This necessitates the availability of multiple spare nodes, leading to poor resource utilization. Moreover, most techniques require a central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing. We propose a scalable fault-tolerant MPI based on LAM/MPI which supports user-level checkpointing, migration, and replication. Our contributions extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate both centralized storage and SAN-based solutions and show that they are not scalable, particularly after 64 CPUs. Our migration strategy is the first to make no assumptions on restart topologies, eliminating the need for spare nodes. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes. We show that checkpointing and replication can be achieved with much lower overhead than current techniques and near transparency to the end user while still providing fault resilience.
V.: A scalable asynchronous replication-based strategy for fault tolerant MPI applications
- of Lecture Notes in Computer Science
, 2007
"... Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpoint ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized stor-age, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low over-head of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrat-ing that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques. 1
Dynamic Load Balancing for I/O-Intensive Applications on Clusters
, 2009
"... Load balancing for clusters has been investigated extensively, mainly focusing on the effective usage of global CPU and memory resources. However, previous CPUor memory-centric load balancing schemes suffer significant performance drop under I/O-intensive workloads due to the imbalance of I/O load. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Load balancing for clusters has been investigated extensively, mainly focusing on the effective usage of global CPU and memory resources. However, previous CPUor memory-centric load balancing schemes suffer significant performance drop under I/O-intensive workloads due to the imbalance of I/O load. To solve this problem, we propose two simple yet effective I/O-aware load-balancing schemes for two types of clusters: (1) homogeneous clusters where nodes are identical and (2) heterogeneous clusters, which are comprised of a variety of nodes with different performance characteristics in computing power, memory capacity, and disk speed. In addition to assigning I/O-intensive sequential and parallel jobs to nodes with light I/ O loads, the proposed schemes judiciously take into account both CPU and memory load sharing in the system. Therefore, our schemes are able to maintain high performance
Transparent Fault Tolerance for Job Healing in HPC Environments.
"... As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failure ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the
1Replication-Based Fault-Tolerance for MPI Applications
"... Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severel ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI, however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun X4500-based solution, an EMC SAN, and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 % of that of a typical SAN/parallel file system-equipped storage system. Index Terms—fault-tolerance, checkpointing, MPI, file systems F
Component
, 2009
"... Grid applications have been prone to encountering problems such as failures or malicious attacks during execution in recent years, due to their distributed and large-scale features. The application itself, however, has limited power to address these problems. This paper presents the design, implemen ..."
Abstract
- Add to MetaCart
(Show Context)
Grid applications have been prone to encountering problems such as failures or malicious attacks during execution in recent years, due to their distributed and large-scale features. The application itself, however, has limited power to address these problems. This paper presents the design, implementation, and evaluation of an adaptive framework — Dynasa, which strives to handle security problems using adaptive fault-tolerance (i.e., checkpointing and replication) during the execution of applications according to the status of the Grid environments.We evaluate our adaptive framework experimentally using the Grid5000 testbed and the experimental results have demonstrated that Dynasa enables the application itself to handle the security problems efficiently. The starting of the adaptive component is less than 1 s and the adaptive action is less than 0.1 s with the checkpoint interval of 20 s. Compared with non-adaptive method, experimental results demonstrate thatDynasa achieves better performance in termsof execution time, network bandwidth consumed, and CPU load, resulting in up to a 50 % lower overhead. © 2009 Elsevier B.V. All rights reserved. 1.