Results 1 - 10
of
40
The design and implementation of Zap: A system for migrating computing environments
- In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002
, 2002
"... We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decoupl ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decouples processes in pods from dependencies to the host operating system and other processes on the system. By integrating Zap virtualization with a checkpoint-restart mechanism, Zap can migrate a pod of processes as a unit among machines running independent operating systems without leaving behind any residual state after migration. We have implemented a Zap prototype in Linux that supports transparent migration of unmodified applications without any kernel modifications. We demonstrate that our Linux Zap prototype can provide general-purpose process migration functionality with low overhead. Our experimental results for migrating pods used for running a standard user’s X windows desktop computing environment and for running an Apache web server show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies. 1
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
The design and implementation of Berkeley Lab’s linux Checkpoint/Restart
, 2003
"... Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance ..."
Abstract
-
Cited by 62 (2 self)
- Add to MetaCart
Clusters of commodity computers running Linux are becoming an increasingly popular platform for highperformance
The Internet Backplane Protocol: Storage in the Network
, 1999
"... For distributed and network applications, efficient management of program state is critical to performance and functionality. To support domain- and application-specific optimization of data movement, we have developed the Internet Backplane Protocol (IBP) for controlling storage that is implemented ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
For distributed and network applications, efficient management of program state is critical to performance and functionality. To support domain- and application-specific optimization of data movement, we have developed the Internet Backplane Protocol (IBP) for controlling storage that is implemented as part the network fabric itself. IBP allows an application to control intermediate data staging operations explicitly as data is communicated between processes. As such, the application can exploit locality and manage scarce buffer resources effectively. In this paper, we discuss the development of IBP, the implementation of a prototype system for managing network storage, and a preliminary deployment as part of the Internet-2 Distributed Storage Initiative. 1 Introduction The proliferation of applications that are performance limited by network speeds leads us to explore new ways to exploit data locality in distributed settings. Currently, standard networking protocols (such as TCP/IP)...
Scalable Networked Information Processing Environment (SNIPE)
- in Proceedings of SuperComputing '97
, 1997
"... SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale infor ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale information processing applications leading to better availablity and reliability than currently available with typical cluster computing and/or distributed computer environments. Keywords: SNIPE, RCDS, MetaComputing, scalable, secure, reliable Acknowledgements This work was supported in part by the Office of Scientific Computing, U.S. Department of Energy, under Contract DE-AC05-96OR22464, by DARPA under Contract DAAH 04-95-1-0595, and by the National Science Foundation's Center for Research on Parallel Computation, Science and Technology Center Cooperative Agreement No. CCR-8809615. 1. Introduction The beginning of the 21st century will present new challenges for large-scale applications i...
Adaptive incremental checkpointing for massively parallel systems
- In ICS ’04: Proceedings of the 18th annual international conference on Supercomputing
, 2004
"... Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal ch ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application’s memory access patterns.
MobiDesk: Mobile Virtual Desktop Computing
"... We present MobiDesk, a mobile virtual desktop computing hosting infrastructure that leverages continued improvements in network speed, cost, and ubiquity to address the complexity, cost, and mobility limitations of today’s personal computing infrastructure. MobiDesk transparently virtualizes a user’ ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
We present MobiDesk, a mobile virtual desktop computing hosting infrastructure that leverages continued improvements in network speed, cost, and ubiquity to address the complexity, cost, and mobility limitations of today’s personal computing infrastructure. MobiDesk transparently virtualizes a user’s computing session by abstracting underlying system resources in three key areas: display, operating system and network. MobiDesk provides a thin virtualization layer that decouples a user’s computing session from any particular end user device and moves all application logic from end user devices to hosting providers. MobiDesk virtualization decouples a user’s computing session from the underlying operating system and server instance, enabling high availability service by transparently migrating sessions from one server to another during server maintenance or upgrades. We have implemented a MobiDesk prototype in Linux that works with existing unmodified applications and operating system kernels. Our experimental results demonstrate that MobiDesk has very low virtualization overhead, can provide a full-featured desktop experience including full-motion video support, and is able to migrate users’ sessions efficiently and reliably for high availability, while maintaining existing network connections.
CRAK: Linux Checkpoint/Restart As a Kernel Module
, 2001
"... Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift of co ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Process checkpoint/restart is a very useful technology for process migration, load balancing, crash recovery, rollback transaction, job controlling and many other purposes. Although process migration has not yet been widely used and is not widely available commercial systems, the growing shift of computing facilities from supercomputers to networked workstations and distributed systems is increasing the importance and demand for migration technologies. In this paper, we describe the design and implementation of CRAK, an innovative transparent checkpoint/restart package for Linux. CRAK provides transparent migration of Linux networked applications and computing environments without modifying, recompiling, or relinking applications or the operating system. CRAK is the first system for Unix/Linux that provides transparent checkpoint/restart with the following properties: (1) it does not require any modifications of existing operating system or application code and (2) it supports migrating network sockets. Prototype implementations are available for Linux 2.2 and Linux 2.4 kernels.
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
- Journal of Parallel and Distributed Computing
, 2001
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In parti ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. Keywords: Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions. 1
Market-based Cluster Resource Management
, 2001
"... Resource management in high-performance, cluster computer systems is a challenging problem. Resources must be allocated amongst competing applications of varying levels of importance, and aggregate resource demand needs to be controlled to keep the system in a comfortable regime of operation. Effect ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Resource management in high-performance, cluster computer systems is a challenging problem. Resources must be allocated amongst competing applications of varying levels of importance, and aggregate resource demand needs to be controlled to keep the system in a comfortable regime of operation. Effectively performing these tasks requires knowledge of user valuations of the resources being allocated and having a feedback signal that causes users to back off the system when it is overloaded. Unfortunately, current approaches to cluster resource management provide little, if any, means for users to express resource valuations and to inuence their resource allocations. In addition, while feedback signals are provided, there are no associated incentives for users to pay attention to and respond to them. As a result, traditional systems are incapable of delivering the maximum possible value to users. The thesis of this work is that...

