Results 1 - 10
of
13
Practical Model Checking Method for Verifying Correctness of MPI Programs
- EuroPVM/MPI
, 2007
"... Abstract. Formal verification of programs often requires creating a model of the program and running it through a model-checking tool. However, this model-creation step is itself error prone, tedious, and difficult to do for someone not familiar with formal verification. In this paper, we describe a ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Abstract. Formal verification of programs often requires creating a model of the program and running it through a model-checking tool. However, this model-creation step is itself error prone, tedious, and difficult to do for someone not familiar with formal verification. In this paper, we describe a tool for verifying correctness of MPI programs that does not require the creation of a model and instead works directly on the MPI program. Such a tool is useful in the hands of average MPI programmers. Our tool uses the MPI profiling interface, PMPI, to trap MPI calls and hand over control of the MPI function execution to a scheduler. The scheduler verifies correctness of the program by executing all “relevant ” interleavings of the program. The scheduler records an initial trace and replays its interleaving variants by using dynamic partial-order reduction. We describe the design and implementation of the tool and compare it with our previous work based on model checking. 1
Group-based coordinated checkpointing for mpi: A case study on infiniband
- Parallel Processing, 2007. ICPP 2007. International Conference on
, 2007
"... As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practi ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practice, coordinated checkpointing has a known limitation on scalability. Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications. In this paper, we propose a novel group-based checkpointing design to alleviate this scalability limitation. By carefully scheduling the MPI processes to take checkpoints in smaller groups, our design reduces the number of processes simultaneously taking checkpoints, while allowing those processes not taking checkpoints to proceed with computation. We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner. Experimental results show our group-based checkpointing design can reduce the effective delay for checkpointing significantly, up to 78 % for HPL and up to 70 % for MotifMiner. 1.
Interconnect agnostic checkpoint/restart in Open MPI
- Proceedings of the 18th ACM international symposium on High Performance Distributed Computing HPDC
, 2009
"... Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application develop ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications. 1.
Accelerating checkpoint operation by node-level write aggregation on multicore systems
, 2009
"... Abstract—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with incr ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size due to constraints within the file system. Furthermore, with the advent of multi-core architecture, the situation is aggravated due to larger number of processes running on the same node, trying to checkpoint simultaneously. This results in increased number of file writes at the time of checkpointing which leads to performance degradation. As a result, deployment of Checkpoint/Restart mechanisms for large scale parallel applications is limited. In this work, we explore the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library. Our profiling of the checkpoints for the NAS parallel benchmarks revealed a large number of small file writes interspersed with large writes. Based on these observation we propose to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes. At the aggregation threshold of 512KB, the implementation of our design in BLCR shows improvements from 27 % to 32 % over the original BLCR in terms of time cost to checkpoint an MPI application. I.
Application-Level Checkpointing Techniques for Parallel Programs ⋆
"... Abstract. In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefu ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications. 1
1 Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture
"... Abstract—Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-fact ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-facto standard for parallel programming, is widely used on such large clusters. Many MPI implementations use Checkpoint/Restart schemes using the Berkeley Lab Checkpoint Restart (BLCR) Library to achieve some level of fault tolerance. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size. As a result, the deployment of Checkpoint/Restart mechanisms for large scale parallel applications is compromised. In our previous work, we proposed a technique to aggregate certain categories of checkpoint writes to reduce the checkpointing overhead. However, an application still experiences slow checkpoint writing because it is blocked waiting for its checkpoint file writes to complete. In this paper, we propose the Write Aggregation with Dynamic Buffer and Interleaving scheme to reduce the overhead related to checkpoint creation. By aggregating all checkpoint writes into a dynamic buffer pool and overlapping the application progress with the file writes, our algorithm is able to significantly reduce checkpoint creation overhead. In the experiments using 64 processor cores, our design demonstrates a speedup of 2.62 times in terms of checkpoint creation time when compared to the original BLCR design. Our scheme also reduces the impact of checkpointing on the application execution time from 20 % to 6 % when 3 checkpoints are taken during an application run. I.
RDMA-Based Job Migration Framework for MPI over InfiniBand
"... Abstract—Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly largesized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVA-PICH2 [1], an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters. I.
Selective Recovery From Failures In A Task Parallel Programming Model
"... Abstract—We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism.Compared withconventionalcheckpoint/restarttechniques, this system offers a recovery penalty that is pro ..."
Abstract
- Add to MetaCart
Abstract—We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism.Compared withconventionalcheckpoint/restarttechniques, this system offers a recovery penalty that is proportional to the degree of failure rather than the system size. We evaluate this system using the Self Consistent Field (SCF) kernel which forms an important component in ab initio methods for computational chemistry. Experimental results indicate that fault tolerant task pools are robust in the presence of an arbitrary number of failures and that they offer low overhead in the absence of faults. Keywords-Parallel processing, fault tolerance, task parallelism, Global Arrays, PGAS, selective recovery
2010 International Workshop on Storage Network Architecture and Parallel I/Os Enhancing Checkpoint Performance with Staging IO and SSD ∗
"... With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well ..."
Abstract
- Add to MetaCart
With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 [6] with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2. 1
Enhancing Checkpoint Performance with Staging IO and SSD ∗
"... With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well ..."
Abstract
- Add to MetaCart
With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 [6] with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2. 1

