Results 1 - 10
of
28
Debugging operating systems with time-traveling virtual machines
, 2005
"... Operating systems are difficult to debug with traditional cyclic debugging. They are non-deterministic; they run for long periods of time; they interact directly with hardware devices; and their state is easily perturbed by the act of debugging. This paper describes a time-traveling virtual machine ..."
Abstract
-
Cited by 114 (7 self)
- Add to MetaCart
Operating systems are difficult to debug with traditional cyclic debugging. They are non-deterministic; they run for long periods of time; they interact directly with hardware devices; and their state is easily perturbed by the act of debugging. This paper describes a time-traveling virtual machine that overcomes many of the difficulties associated with debugging operating systems. Time travel enables a programmer to navigate backward and forward arbitrarily through the execution history of a particular run and to replay arbitrary segments of the past execution. We integrate time travel into a general-purpose debugger to enable a programmer to debug an OS in reverse, implementing commands such as reverse breakpoint, reverse watchpoint, and reverse single step. The space and time overheads needed to support time travel are reasonable for debugging, and movements in time are fast enough to support interactive debugging. We demonstrate the value of our time-traveling virtual machine by using it to understand and fix several OS bugs that are difficult to find with standard debugging tools. Reverse debugging is especially helpful in finding bugs that are fragile due to non-determinism, bugs in device drivers, bugs that require long runs to trigger, bugs that corrupt the stack, and bugs that are detected after the relevant stack frame is popped. 1
Finding and Reproducing Heisenbugs in Concurrent Programs
"... Concurrency is pervasive in large systems. Unexpected interference among threads often results in “Heisenbugs” that are extremely difficult to reproduce and eliminate. We have implemented a tool called CHESS for finding and reproducing such bugs. When attached to a program, CHESS takes control of th ..."
Abstract
-
Cited by 51 (7 self)
- Add to MetaCart
Concurrency is pervasive in large systems. Unexpected interference among threads often results in “Heisenbugs” that are extremely difficult to reproduce and eliminate. We have implemented a tool called CHESS for finding and reproducing such bugs. When attached to a program, CHESS takes control of thread scheduling and uses efficient search techniques to drive the program through possible thread interleavings. This systematic exploration of program behavior enables CHESS to quickly uncover bugs that might otherwise have remained hidden for a long time. For each bug, CHESS consistently reproduces an erroneous execution manifesting the bug, thereby making it significantly easier to debug the problem. CHESS scales to large concurrent programs and has found numerous bugs in existing systems that had been tested extensively prior to being tested by CHESS. CHESS has been integrated into the test frameworks of many code bases inside Microsoft and is used by testers on a daily basis. 1
Configuration Debugging as Search: Finding the Needle in the Haystack
- In OSDI
, 2004
"... This work addresses the problem of diagnosing configuration errors that cause a system to function incorrectly. For example, a change to the local firewall policy could cause a network-based application to malfunction. Our approach is based on searching across time for the instant the system transit ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
This work addresses the problem of diagnosing configuration errors that cause a system to function incorrectly. For example, a change to the local firewall policy could cause a network-based application to malfunction. Our approach is based on searching across time for the instant the system transitioned into a failed state. Based on this information, a troubleshooter or administrator can deduce the cause of failure by comparing system state before and after the failure. We present the Chronus tool, which automates the task of searching for a failure-inducing state change. Chronus takes as input a user-provided software probe, which differentiates between working and non-working states. Chronus performs “time travel ” by booting a virtual machine off the system’s disk state as it existed at some point in the past. By using binary search, Chronus can find the fault point with effort that grows logarithmically with log size. We demonstrate that Chronus can diagnose a range of common configuration errors for both client-side and server-side applications, and that the performance overhead of the tool is not prohibitive. 1
Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay
- In ASPLOS’09
"... While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently — some being recorded, others being replayed and even others running without recording or replay. They also need to manage limited-size log buffers. This paper addresses these shortcomings by introducing, for the first time, a set of abstractions and a softwarehardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo, introduces the novel abstraction of the Replay Sphere to separate the responsibilities of the hardware and software components of the replay system. In this paper, we also design and build CapoOne, a prototype of a deterministic multiprocessor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4-processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flexibility of software-based schemes.
Debugging Parallel Systems: A State of the Art Report
, 2002
"... In this State of the art Report (SotA), we will give an introduction to work presented in the area of debugging large software systems with modern hardware architectures. We will discuss techniques used for single- multi- and distributed systems. In addition we will provide pointers to work by large ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
In this State of the art Report (SotA), we will give an introduction to work presented in the area of debugging large software systems with modern hardware architectures. We will discuss techniques used for single- multi- and distributed systems. In addition we will provide pointers to work by large players in the field, and major conferences of importance.
A Perturbation-Free Replay Platform for Cross-Optimized Multithreaded Applications
, 2001
"... Development of multithreaded applications is particularly tricky because of their non-deterministic execution behaviors. Tools that support the debugging and performance tuning of such applications are needed. Key to the construction of such tools is the ability to repeat the nondeterministic execut ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Development of multithreaded applications is particularly tricky because of their non-deterministic execution behaviors. Tools that support the debugging and performance tuning of such applications are needed. Key to the construction of such tools is the ability to repeat the nondeterministic execution behavior of a multithreaded application. A clean separation between the application and the system that runs it facilitates supporting that ability. This paper presents a platform for constructing such tools in a context in which any separation between the application and the underlying system (and between both and the platform 's own instrumentation code) has been obscured. DejaVu supports deterministic replay of nondeterministic executions of multithreaded Java programs on the Jalapeno virtual machine (running on a uniprocessor). Jalapeno is written in Java and its optimizing compiler regularly integrates application, virtual machine, and DejaVu instrumentation code into unified machi...
Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism
"... Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a ne ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a new way to support deterministic replay of shared memory multithreaded programs on commodity multiprocessor hardware. Respec targets online replay in which the recorded and replayed processes execute concurrently. Respec uses two strategies to reduce overhead while still ensuring correctness: speculative logging and externally deterministic replay. Speculative logging optimistically logs less information about shared memory dependencies than is needed to guarantee deterministic replay, then recovers and retries if the replayed process diverges from the recorded process. Externally deterministic replay relaxes the degree to which the two executions must match by requiring only their system output and final program states match. We show that the combination of these two techniques results in low recording and replay overhead for the common case of datarace-free execution intervals and still ensures correct replay for execution intervals that have data races. We modified the Linux kernel to implement our techniques. Our software system adds on average about 18 % overhead to the execution time for recording and replaying programs with two threads and 55 % overhead for programs with four threads.
Practical object-oriented back-in-time debugging
- In 22nd European Conference on Object-Oriented Programming (ECOOP’08), volume 5142 of LNCS
, 2008
"... Abstract. Back-in-time debuggers are extremely useful tools for identifying the causes of bugs. Unfortunately the “omniscient ” approaches that try to remember all previous states are impractical because they consume too much space or they are far too slow. Several approaches rely on heuristics to l ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract. Back-in-time debuggers are extremely useful tools for identifying the causes of bugs. Unfortunately the “omniscient ” approaches that try to remember all previous states are impractical because they consume too much space or they are far too slow. Several approaches rely on heuristics to limit these penalties, but they ultimately end up throwing out too much relevant information. In this paper we propose a practical approach that attempts to keep track of only the relevant data. In contrast to other approaches, we keep object history information together with the regular objects in the application memory. Although seemingly counterintuitive, this approach has the effect that data not reachable from current application objects (and hence, no longer relevant) is garbage collected. We describe the technical details of our approach, and we present benchmarks that demonstrate that memory consumption stays within practical bounds. Furthermore, the performance penalty is significantly less than with other approaches. 1
Pdb: Pervasive debugging with xen
- In Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (Grid
, 2004
"... Building distributed grid applications is notoriously difficult: the complex interactions between concurrently running processes, middleware, operating systems, underlying devices, and interconnecting networks can lead to unpredictable and difficult to analyze errors. Yet debugging support for such ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Building distributed grid applications is notoriously difficult: the complex interactions between concurrently running processes, middleware, operating systems, underlying devices, and interconnecting networks can lead to unpredictable and difficult to analyze errors. Yet debugging support for such systems is woefully inadequate; typically a central user interface coordinates a set of conventional debuggers. This structure leads to synchronization problems and is limited to debugging user-mode applications. In this paper we present the design and implementation of PDB, a pervasive debugger which executes in a virtualization layer underneath the entire distributed system. By running each node of a distributed application in a separate virtual environment atop the debugger, PDB can exercise full control over the entire execution environment. 1.
CHESS: A Systematic Testing Tool for Concurrent Software
, 2007
"... Concurrency is used pervasively in the development of large systems programs. However, concurrent programming is difficult because of the possibility of unexpected interference among concurrently executing tasks. Such interference often results in “Heisenbugs ” that appear rarely and are extremely d ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Concurrency is used pervasively in the development of large systems programs. However, concurrent programming is difficult because of the possibility of unexpected interference among concurrently executing tasks. Such interference often results in “Heisenbugs ” that appear rarely and are extremely difficult to reproduce and debug. Stress testing, in which the system is run under heavy load for a long time, is the method commonly employed to flush out such concurrency bugs. This form of testing provides inadequate coverage and has unpredictable results. This paper proposes an alternative called concurrency scenario testing which relies on systematic and exhaustive testing We have implemented a tool called CHESS for performing concurrency scenario testing of systems programs. CHESS uses model checking techniques to systematically generate all interleaving of a given scenario. CHESS scales to large concurrent programs and has found numerous previously unknown bugs in systems that had been stress tested for many months prior to being tested by CHESS. For each bug, CHESS is able to consistently reproduce an erroneous execution manifesting the bug, thereby making it significantly easier to debug the problem. CHESS has been integrated into the test frameworks of many code bases inside Microsoft and is being used by testers on a daily basis.

