• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 82
Next 10 →

Parallel I/O Performance for Application–Level Checkpointing on the Blue Gene/P

by Jing Fu, Misun Min, Robert Latham, Christopher D. Carothers - System, Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), in conjunction with IEEE International Conference on Cluster Computing (Cluster), 26–30 , 2011
"... Abstract—As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactiv ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called “reduced-blocking I/O ” (rb

Dynamic malleability in mpi applications

by Kaoutar El Maghraoui, Travis J. Desell, Boleslaw K. Szymanski, Carlos A. Varela - In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007 , 2007
"... Malleability enables a parallel application’s execution system to split or merge processes modifying the parallel application’s granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
processes. Malleability empowers process migration by allowing the application’s processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the PCM (Process Checkpointing and Migration) library, a user-level library for iterative MPI

Dynamic Malleability in Iterative MPI Applications

by K. El Maghraoui, Travis J. Desell, Boleslaw K. Szymanski, Carlos A. Varela
"... Malleability enables a parallel application’s execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application’s processes. Malleability empowers pro ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
process migration by allowing the application’s processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the PCM (Process Checkpointing and Migration) library, a user-level library for iterative MPI applications. PCM is integrated

A Middleware Framework for Dynamically Reconfigurable MPI Applications

by Kaoutar Elmaghraoui, Carlos A. Varela, Boleslaw K. Szymanski, Joseph E. Flaherty, James D. Teresco
"... Computational grids are characterized by their dynamic, non-dedicated, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments to be able to benefit ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
to benefit from computational grids ’ resources. In this paper, we introduce a new software framework that enhances the Message Passing Interface (MPI) performance through process checkpointing, migration, and an adaptive middleware for load balancing. Fields as diverse as fluid dynamics, material science

DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop

by Jason Ansel, Kapil Arya, Gene Cooperman
"... DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs ..."
Abstract - Cited by 44 (8 self) - Add to MetaCart
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs

Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method

by Wei-keng Liao, Avery Ching, Kenin Coloma, Alok Choudhary, Mahmut K
"... Many large-scale production applications often have very long executions times and require periodic data checkpoints in order to save the state of the computation for program restart and/or tracing application progress. These write-only operations often dominate the overall application runtime, whic ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
, which makes them a good optimization target. Existing approaches for write-behind data buffering at the MPI I/O level have been proposed, but challenges still exist for addressing system-level I/O issues. We propose a twostage write-behind buffering scheme for handing checkpoint operations. The first

Proactive Process-Level Live Migration and Back Migration in HPC Environments

by Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott
"... As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the proc ..."
Abstract - Cited by 37 (11 self) - Add to MetaCart
at the process level. Through health monitoring, a subset of node failures can be anticipated when one’s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment

User-Level Socket-Based Checkpointing for Distributed and Parallel Computation

by Jason Ansel, Michael Rieker, Gene Cooperman , 2009
"... We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes c ..."
Abstract - Add to MetaCart
checkpointing of remotely spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and certain other types

DMTCP: Scalable User-Level Transparent Checkpointing for Cluster Computations

by Jason Ansel, Kapil Arya Gene Cooperman , 2008
"... As the size of clusters increases, failures are becoming increasingly frequent. Applications must become fault tolerant if they are to run for extended periods of time. We present DMTCP (Distributed MultiThreaded CheckPointing), the first user-level distributed checkpointing package not dependent on ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
on a specific message passing library. This contrasts with existing approaches either specific to libraries such as MPI or requiring kernel modification. DMTCP provides fault tolerance through checkpointing. DMTCP transparently checkpoints general cluster computations consisting of many nodes

BlobCR: Efficient checkpoint-restart for hpc applications on iaas clouds using virtual disk image snapshots

by Bogdan Nicolae - in SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis , 2011
"... Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpo ..."
Abstract - Cited by 24 (7 self) - Add to MetaCart
-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing
Next 10 →
Results 1 - 10 of 82
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University