Results 1 - 10
of
46
Practical Dynamic Software Updating
, 2008
"... This dissertation makes the case that programs can be updated while they run, with modest programmer effort, while providing certain update safety guarantees, and without imposing a significant performance overhead. Few systems are designed with on-the-fly updating in mind. Those systems that permit ..."
Abstract
-
Cited by 108 (33 self)
- Add to MetaCart
(Show Context)
This dissertation makes the case that programs can be updated while they run, with modest programmer effort, while providing certain update safety guarantees, and without imposing a significant performance overhead. Few systems are designed with on-the-fly updating in mind. Those systems that permit it support only a very limited class of updates, and generally provide no guarantees that following the update, the system will behave as intended. We tackle the on-the-fly updating problem using a compiler-based approach called dynamic software updating (DSU), in which a program is patched with new code and data while it runs. The challenge is in making DSU practical: it should support changes to programs as they occur in practice, yet be safe, easy to use, and not impose a large overhead. This dissertation makes both theoretical contributions—formalisms for reasoning about, and ensuring update safety—and practical contributions—Ginseng, a DSU implementation for C. Ginseng supports a broad range of changes to C programs, and performs a suite of safety analyses to ensure certain update safety
Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers
- InSupercomputing
, 2005
"... We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault tolerance in Linux clusters. This implementation, based on the ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
(Show Context)
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead—less than 6 % with full checkpointing to disk performed as frequently as once per minute. 1
Automatic logging of operation system effects to guide application-level architecture simulation
- In Proceedings of SIGMetrics/Performance 2006
, 2006
"... Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
(Show Context)
Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the application’s execution. Existing applicationlevel simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application’s execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch. In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our applicationlevel simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.
DejaVu: Transparent user-level checkpointing, migration, and recovery for distributed systems
- In IEEE International Parallel and Distributed Processing Symposium
, 2007
"... In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of system ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications or the OS. It uses a new runtime mechanism for transparent incremental checkpointing that captures the least amount of state needed to maintain global consistency and provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. Performance results from the production-ready implementation show less than 5 % overhead in real-world parallel applications with large memory footprints. 1 1
Compiler-enhanced incremental checkpointing for openmp applications
- in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ser. PPoPP ’08
"... As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save t ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
(Show Context)
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Al-though a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, significantly reduces checkpoint sizes and enables asynchronous checkpointing. 1
Transparent user-level checkpointing for the Native POSIX Thread Library for Linux
- In Proc. of PDPTA-06
, 2006
"... Abstract — Checkpointing of single-threaded applications has been long studied [3], [6], [8], [12], [15]. Much less research has been done for user-level checkpointing of multithreaded applications. Dieter and Lumpp studied the issue for LinuxThreads in Linux 2.2. However, that solution does not wor ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
Abstract — Checkpointing of single-threaded applications has been long studied [3], [6], [8], [12], [15]. Much less research has been done for user-level checkpointing of multithreaded applications. Dieter and Lumpp studied the issue for LinuxThreads in Linux 2.2. However, that solution does not work on later versions of Linux. We present an updated solution for Linux 2.6, which uses the more recent NPTL (Native POSIX Thread Library). Unlike the earlier solution, we do not need to patch glibc. Additionally, the new implementation can take advantage of the ELF architecture to eliminate the earlier requirement to patch the user’s main routine. This fills in the missing link for full transparency. As one demonstration of the robustness, we checkpoint the Kaffe Java Virtual Machine including any of several multithreaded Java programs running on top of it. Index Terms — checkpointing, user-level checkpointing, multithreaded, Linux, NPTL, NAS Parallel Benchmark
Mementos: System support for long-running computation on RFID-scale devices
- In ASPLOS
, 2011
"... Abstract Many computing systems include mechanisms designed to defend against sudden catastrophic losses of computational state, but few systems treat such losses as the common case rather than exceptional events. On the other end of the spectrum are transiently powered computing devices such as RF ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
Abstract Many computing systems include mechanisms designed to defend against sudden catastrophic losses of computational state, but few systems treat such losses as the common case rather than exceptional events. On the other end of the spectrum are transiently powered computing devices such as RFID tags and smart cards; these devices are typically paired with code that must complete its task under tight time constraints before running out of energy. Mementos is a software system that transforms general-purpose programs into interruptible computations that are protected from frequent power losses by automatic, energy-aware state checkpointing. Mementos comprises a collection of optimization passes for the LLVM compiler infrastructure and a linkable library that exercises hardware support for energy measurement while managing state checkpoints stored in nonvolatile memory. We evaluate Mementos against diverse test cases and find that, although it introduces time overhead of up to 60% in our tests versus uninstrumented code executed without power failures, it effectively spreads program execution across zero or more complete losses of power and state. Other contributions of this work include a
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
- In ICFP ’06
, 2006
"... Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly re-execute a region of code, it must ensure ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
(Show Context)
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly re-execute a region of code, it must ensure that all other threads which have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward since thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our experimental results on several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.
Implementation and evaluation of application-level checkpoint-recovery scheme for MPI programs
- In 2004 Conference on Super Computing (SC2004
, 2006
"... It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the computation’s state is saved periodically to disk. Upon failure the computation is restarted from the last saved state. The common CPR m ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the computation’s state is saved periodically to disk. Upon failure the computation is restarted from the last saved state. The common CPR mechanism, called System-level Checkpointing (SLC), requires modifying the Operating System and the communication libraries to enable them to save the state of the entire parallel application. This approach is not portable since a checkpointer for one system rarely works on another. Application-level Checkpointing (ALC) is a portable alternative where the programmer manually modifies their program to enable CPR, a very labor-intensive task. We are investigating the use of compiler technology to instrument codes to embed the ability to tolerate faults into applications themselves, making them self-checkpointing and self-restarting on any platform. In [9] we described a general approach for checkpointing shared memory APIs at the application level. Since [9] applied to only a toy feature set common to most shared memory APIs, this paper shows the practicality of this approach by extending it to a specific popular shared memory API: OpenMP. We describe the challenges involved in providing automated ALC for OpenMP applications and experimentally validate this approach by showing detailed performance results for our implementation of this technique. Our experiments with the NAS OpenMP benchmarks [1] and the EPCC microbenchmarks [21] show generally low overhead on three different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and highlight important lessons about the performance characteristics of this aproach. 1.
Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications
"... Advances in hardware have enabled many long-running applications to execute entirely in main memory. As a result, these applications have increasingly turned to database techniques to ensure durability in the event of a crash. However, many of these applications, such as massively multiplayer online ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Advances in hardware have enabled many long-running applications to execute entirely in main memory. As a result, these applications have increasingly turned to database techniques to ensure durability in the event of a crash. However, many of these applications, such as massively multiplayer online games and mainmemory OLTP systems, must sustain extremely high update rates – often hundreds of thousands of updates per second. Providing durability for these applications without introducing excessive overhead or latency spikes remains a challenge for application developers. In this paper, we take advantage of frequent points of consistency in many of these applications to develop novel checkpoint recovery algorithms that trade additional space in main memory for significantly lower overhead and latency. Compared to previous work, our new algorithms do not require any locking or bulk copies of the application state. Our experimental evaluation shows that one of our new algorithms attains nearly constant latency and reduces overhead by more than an order of magnitude for low to medium update rates. Additionally, in a heavily loaded main-memory transaction processing system, it still reduces overhead by more than a factor of two.