Results 1 - 10
of
89
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 716 (22 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail
- IN SEARCH OF THE HOLY GRAIL. DISTRIBUTED COMPUTING
, 1994
"... The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concern ..."
Abstract
-
Cited by 230 (3 self)
- Add to MetaCart
(Show Context)
The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships.
Jockey: A user-space library for record-replay debugging
- In AADEBUG’05: Proceedings of the sixth international symposium on Automated analysis-driven debugging
, 2005
"... Jockey is an execution record/replay tool for debugging Linux programs. It records invocations of system calls and CPU instructions with timing-dependent effects and later replays them deterministically. It supports process checkpointing to diagnose long-running programs efficiently. Jockey is imple ..."
Abstract
-
Cited by 77 (0 self)
- Add to MetaCart
(Show Context)
Jockey is an execution record/replay tool for debugging Linux programs. It records invocations of system calls and CPU instructions with timing-dependent effects and later replays them deterministically. It supports process checkpointing to diagnose long-running programs efficiently. Jockey is implemented as a shared-object file that runs as a part of the target process. While this design is the key for achieving Jockey’s goal of safety and ease of use, it also poses challenges. This paper discusses some of the practical issues we needed to overcome in such environments, including low-overhead system-call interception, techniques for segregating resource usage between Jockey and the target process, and an interface for finegrain control of Jockey’s behavior.
Parallel Performance Prediction using Lost Cycles Analysis
- IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
(Show Context)
Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used inpractice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real-world programs. In this paper we describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling. Our approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that ts these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.
Requirements for Data-Parallel Programming Environments
, 1994
"... this paper is to convey an understanding of the tools and strategies that will be needed to adequately support efficient, machineindependent data-parallel programming. To achieve our goal, we will examine the requirements for such tools and describe promising implementation strategies for meeting th ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
this paper is to convey an understanding of the tools and strategies that will be needed to adequately support efficient, machineindependent data-parallel programming. To achieve our goal, we will examine the requirements for such tools and describe promising implementation strategies for meeting these requirements. April 22, 1994 Requirements for Data-Parallel Programming Environments 3 of 23
The Search for Lost Cycles: A New Approach to Parallel Program Performance Evaluation
- In Proceedings of Supercomputing '94
, 1993
"... Traditional performance debugging and tuning of parallel programs is based on the "measuremodify " approach, in which detailed measurements of program executions are used to guide incremental changes to the program that result in better performance. Unfortunately, the performance of a para ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Traditional performance debugging and tuning of parallel programs is based on the "measuremodify " approach, in which detailed measurements of program executions are used to guide incremental changes to the program that result in better performance. Unfortunately, the performance of a parallel algorithm is often related to its implementation, input data, and machine characteristics in surprising ways, and the "measure-modify" approach is unsuited to exploring these relationships fully: it is too heavily dependent on experimentation and measurement, which is impractical for studying the large number of variables that can affect parallel program performance. In this paper we argue that the problem of selecting the best implementation of a parallel algorithm requires a new approach to parallel program performance evaluation, one with a greater balance between measurement and modeling. We first present examples that demonstrate that different parallelizations of a program may be necessary ...
Re-execution of Distributed Programs to Detect Bugs Hidden by Racing Messages
- In Proceedings of the International Conference on System Sciences
, 1997
"... Finding errors in non-deterministic programs is complicated by the fact that an anomaly may occur during one program execution, and not the next. Our objective is to provide a practical yet powerful testing environment for distributed systems, using re-execution. We focus on re-executing the program ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Finding errors in non-deterministic programs is complicated by the fact that an anomaly may occur during one program execution, and not the next. Our objective is to provide a practical yet powerful testing environment for distributed systems, using re-execution. We focus on re-executing the program, under a strictly different message ordering. We show that messages are grouped into waves, such that any two messages from different waves must always be received in the same order. We provide an algorithm that produces a re-execution that maximizes the number of reordered pairs of message delivery events. We also provide an efficient online algorithm for detecting racing messages.
Trace: Parallel trace replay with approximate causal events
- In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07). MCDOUGALL
, 2007
"... //TRACE 1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior. Its tracing engine automatically discovers inter-node data dependencies and inter-I/O compute times for each node (process) in an application. This information is reflected in per ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
(Show Context)
//TRACE 1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior. Its tracing engine automatically discovers inter-node data dependencies and inter-I/O compute times for each node (process) in an application. This information is reflected in per-node annotated I/O traces. Such annotation allows a parallel replayer to closely mimic the behavior of a traced application across a variety of storage systems. When compared to other replay mechanisms, //TRACE offers significant gains in replay accuracy. Overall, the average replay error for the parallel applications evaluated in this paper is below 6%. 1
Predicate control for active debugging of distributed programs
, 1998
"... Existing approaches to debugging distributed systems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicate control approach involves a cycle of observation followed by controlled replaying ..."
Abstract
-
Cited by 18 (9 self)
- Add to MetaCart
(Show Context)
Existing approaches to debugging distributed systems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicate control approach involves a cycle of observation followed by controlled replaying of computations, based on observation. We formalize the predicate control problem for both offline and on-line scenarios. We prove that off-line predicate control for general boolean predicates is NP-hard. However, we provide an efficient solution for off-line predicate control for the class of disjunctive predicates. We further solve on-line predicate control for disjunctive predicates under certain restrictions on the system. Lastly, we demonstrate how both off-line and on-line predicate control facilitate distributed debugging by allowing the programmer to control computations to maintain global safety properties. 1.