Results 1 - 10
of
37
DMP: Deterministic Shared Memory Multiprocessing
"... Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded cod ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently ∗
"... Support for deterministic replay of multithreaded execution can greatly help in finding concurrency bugs. For highest effectiveness, replay schemes should (i) record at production-run speed, (ii) keep their logging requirements minute, and (iii) replay at a speed similar to that of the initial execu ..."
Abstract
-
Cited by 38 (12 self)
- Add to MetaCart
Support for deterministic replay of multithreaded execution can greatly help in finding concurrency bugs. For highest effectiveness, replay schemes should (i) record at production-run speed, (ii) keep their logging requirements minute, and (iii) replay at a speed similar to that of the initial execution. In this paper, we propose a new substrate for deterministic replay that provides substantial advances along these axes. In our proposal, processors execute blocks of instructions atomically, as in transactional memory or speculative multithreading, and the system only needs to record the commit order of these blocks. We call our scheme DeLorean. Our results show that DeLorean records execution at a speed similar to that of Release Consistency (RC) execution and replays at about 82 % of its speed. In contrast, most current schemes only record at the speed of Sequential Consistency (SC) execution. Moreover, DeLorean only needs 7.5 % of the log size needed by a state-of-the-art scheme. Finally, DeLorean can be configured to need only 0.6 % of the log size of the state-of-the-art scheme at the cost of recording at 86 % of RC’s execution speed — still faster than SC. In this configuration, the log of an 8-processor 5-GHz machine is estimated to be only about 20GB per day. 1.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution
"... The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically. A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually. Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.
PRES: Probabilistic Replay with Execution Sketching on Multiprocessors
"... Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions. This paper proposes a ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions. This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of “reproducing the bug on the first replay attempt ” to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as “sketches”) during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic
Retrace: Collecting execution trace with virtual machine deterministic replay
- In Proceedings of the 3rd Annual Workshop on Modeling, Benchmarking and Simulation, MoBS
, 2007
"... Execution trace is an important tool in computer architecture research. Unfortunately, existing trace collection techniques are often slow (due to software tracing overheads) or expensive (due to special tracing hardware requirements). Regardless of the method of collection, detailed trace files are ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
Execution trace is an important tool in computer architecture research. Unfortunately, existing trace collection techniques are often slow (due to software tracing overheads) or expensive (due to special tracing hardware requirements). Regardless of the method of collection, detailed trace files are generally large and inconvenient to store and share. We present ReTrace, a trace collection tool based on the deterministic replay technology of the VMware hypervisor. ReTrace operates in two stages: capturing and expansion. ReTrace capturing accumulates the minimal amount of information necessary to later recreate a more detailed execution trace. It captures (records) only non-deterministic events resulting in low time and space overheads (as low as 5 % run-time overhead, as low as 0.5 byte per thousand instructions log growth rate) on supported platforms. ReTrace expansion uses the information collected by the capturing stage to generate a complete and accurate execution trace without any data loss or distortion. ReTrace is an experimental feature of VMware Workstation 6.0 currently available in Windows and Linux flavors for commodity IA32 platforms. No special tracing hardware is required. We have three key results. First, we find that trace collection can be done both efficiently and inexpensively. Second, deterministic replay is an effective technique for compressing large trace files. Third, performing the trace collection at the hypervisor layer is minimally invasive to the collected trace while enabling tracing of the entire system (user/supervisor level, CPU, peripheral devices). ReTrace is a rapidly evolving technology. We would like to use this paper to solicit feedback on the applicability of ReTrace in computer architecture research to help us refine our future development plans. 1
R2: An application-level kernel for record and replay
- In OSDI
, 2008
"... Library-based record and replay tools aim to reproduce an application’s execution by recording the results of selected functions in a log and during replay returning the results from the log rather than executing the functions. These tools must ensure that a replay run is identical to the record run ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
Library-based record and replay tools aim to reproduce an application’s execution by recording the results of selected functions in a log and during replay returning the results from the log rather than executing the functions. These tools must ensure that a replay run is identical to the record run. The challenge in doing so is that only invocations of a function by the application should be recorded, recording the side effects of a function call can be difficult, and not executing function calls during replay, multithreading, and the presence of the tool may change the application’s behavior from recording to replay. These problems have limited the use of such tools. R2 allows developers to choose functions that can be recorded and replayed correctly. Developers annotate the chosen functions with simple keywords so that R2 can handle calls with side effects and multithreading. R2 generates code for record and replay from templates, allowing developers to avoid implementing stubs for hundreds of functions manually. To track whether an invocation is on behalf of the application or the implementation of a selected function, R2 maintains a mode bit, which stubs save and restore. We have implemented R2 on Windows and annotated large parts (1,300 functions) of the Win32 API, and two higher-level interfaces (MPI and SQLite). R2 can replay multithreaded web and database servers that previous library-based tools cannot replay. By allowing developers to choose high-level interfaces, R2 can also keep recording overhead small; experiments show that its recording overhead for Apache is approximately 10%, that recording and replaying at the SQLite interface can reduce the log size up to 99 % (compared to doing so at the Win32 API), and that using optimization annotations for BitTorrent and MPI applications achieves log size reduction ranging from 13.7 % to 99.4%. 1
Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay
- In ASPLOS’09
"... While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently — some being recorded, others being replayed and even others running without recording or replay. They also need to manage limited-size log buffers. This paper addresses these shortcomings by introducing, for the first time, a set of abstractions and a softwarehardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo, introduces the novel abstraction of the Replay Sphere to separate the responsibilities of the hardware and software components of the replay system. In this paper, we also design and build CapoOne, a prototype of a deterministic multiprocessor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4-processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flexibility of software-based schemes.
HARD: Hardware-Assisted Lockset-based Race Detection
"... Abstract The emergence of multicore architectures will lead to anincrease in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Softwaresolutions for detecting data races generally incur large overheads. Hardware support for race detection can sig-nific ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Abstract The emergence of multicore architectures will lead to anincrease in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Softwaresolutions for detecting data races generally incur large overheads. Hardware support for race detection can sig-nificantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during themonitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues suchas storing the lockset information and performing complex set operations, so far it has been implemented only in soft-ware with 10-30 times performance hit. This paper proposes the first hardware implementation(called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal over-head. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bit-wise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomlyinjected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20 % more than happens-before,with only 0.1-2.6 % of execution overhead. We also show our hardware design is cost-effective by comparing with theideal lockset implementation, which would require a large amount of hardware resources.
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
"... Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content d ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors. Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called Sher-Log, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log’s semantics. It infers both control and data value information regarding to the failed execution. We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.
Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems
"... We present Scribe, the first system to provide transparent, lowoverhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to efficiently record nondeterministic interactions such as ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
We present Scribe, the first system to provide transparent, lowoverhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to efficiently record nondeterministic interactions such as related system calls, signals, and shared memory accesses. Rendezvous points make a partial ordering of execution based on system call dependencies sufficient for replay, avoiding the recording overhead of maintaining an exact execution ordering. Sync points convert asynchronous interactions that can occur at arbitrary times into synchronous events that are much easier to record and replay. We have implemented Scribe without changing, relinking, or recompiling applications, libraries, or operating system kernels, and without any specialized hardware support such as hardware performance counters. It works on commodity Linux operating systems, and commodity multi-core and multiprocessor hardware. Our results show for the first time that an operating system mechanism can correctly and transparently record and replay multi-process and multi-threaded applications on commodity multiprocessors. Scribe recording overhead is less than 2.5 % for server applications including Apache and MySQL, and less than 15 % for desktop applications including Firefox, Acrobat, OpenOffice, parallel kernel compilation, and movie playback.

