Results 1 - 10
of
82
Parallel I/O Performance for Application–Level Checkpointing on the Blue Gene/P
- System, Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), in conjunction with IEEE International Conference on Cluster Computing (Cluster), 26–30
, 2011
"... Abstract—As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactiv ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called “reduced-blocking I/O ” (rb
Dynamic malleability in mpi applications
- In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007
, 2007
"... Malleability enables a parallel application’s execution system to split or merge processes modifying the parallel application’s granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
processes. Malleability empowers process migration by allowing the application’s processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the PCM (Process Checkpointing and Migration) library, a user-level library for iterative MPI
Dynamic Malleability in Iterative MPI Applications
"... Malleability enables a parallel application’s execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes. Malleability empowers pro ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
process migration by allowing the application’s processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the PCM (Process Checkpointing and Migration) library, a user-level library for iterative MPI applications. PCM is integrated
A Middleware Framework for Dynamically Reconfigurable MPI Applications
"... Computational grids are characterized by their dynamic, non-dedicated, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments to be able to benefit ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
to benefit from computational grids ’ resources. In this paper, we introduce a new software framework that enhances the Message Passing Interface (MPI) performance through process checkpointing, migration, and an adaptive middleware for load balancing. Fields as diverse as fluid dynamics, material science
DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop
"... DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs ..."
Abstract
-
Cited by 44 (8 self)
- Add to MetaCart
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs
Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method
"... Many large-scale production applications often have very long executions times and require periodic data checkpoints in order to save the state of the computation for program restart and/or tracing application progress. These write-only operations often dominate the overall application runtime, whic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
, which makes them a good optimization target. Existing approaches for write-behind data buffering at the MPI I/O level have been proposed, but challenges still exist for addressing system-level I/O issues. We propose a twostage write-behind buffering scheme for handing checkpoint operations. The first
Proactive Process-Level Live Migration and Back Migration in HPC Environments
"... As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the proc ..."
Abstract
-
Cited by 37 (11 self)
- Add to MetaCart
at the process level. Through health monitoring, a subset of node failures can be anticipated when one’s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment
User-Level Socket-Based Checkpointing for Distributed and Parallel Computation
, 2009
"... We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes c ..."
Abstract
- Add to MetaCart
checkpointing of remotely spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and certain other types
DMTCP: Scalable User-Level Transparent Checkpointing for Cluster Computations
, 2008
"... As the size of clusters increases, failures are becoming increasingly frequent. Applications must become fault tolerant if they are to run for extended periods of time. We present DMTCP (Distributed MultiThreaded CheckPointing), the first user-level distributed checkpointing package not dependent on ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
on a specific message passing library. This contrasts with existing approaches either specific to libraries such as MPI or requiring kernel modification. DMTCP provides fault tolerance through checkpointing. DMTCP transparently checkpoints general cluster computations consisting of many nodes
BlobCR: Efficient checkpoint-restart for hpc applications on iaas clouds using virtual disk image snapshots
- in SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis
, 2011
"... Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpo ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing
Results 1 - 10
of
82