Results 1 - 10
of
361
Basic concepts and taxonomy of dependable and secure computing
- IEEE TDSC
, 2004
"... Abstract—This paper gives the main definitions relating to dependability, a generic concept including as special case such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Bas ..."
Abstract
-
Cited by 315 (5 self)
- Add to MetaCart
Abstract—This paper gives the main definitions relating to dependability, a generic concept including as special case such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Toueg, “Checkpointing and Rollback-Recovery for Disitributed Systems
- IEEE Transactions on Software Engineering, Vol
, 1987
"... Abstract-We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a ..."
Abstract
-
Cited by 281 (0 self)
- Add to MetaCart
Abstract-We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions. Index Terms-Checkpoint, consistent state, distributed systems, fault-tolerance, rollback-recovery. I.
Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing
, 1988
"... In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during ..."
Abstract
-
Cited by 199 (13 self)
- Add to MetaCart
In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino e ect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work uni es existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit
- IEEE TRANSACTIONS ON COMPUTERS
, 1992
"... Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message loggin ..."
Abstract
-
Cited by 181 (10 self)
- Add to MetaCart
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 181 (9 self)
- Add to MetaCart
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
Sender-based message logging
"... Sender-based message logging is a new low-overhead mechanism for providing transparent fault-tolerance in distributed systems. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine from which the message is sent. Keeping the message ..."
Abstract
-
Cited by 123 (10 self)
- Add to MetaCart
Sender-based message logging is a new low-overhead mechanism for providing transparent fault-tolerance in distributed systems. It differs from conventional message logging mechanisms in that each message is logged in volatile memory on the machine from which the message is sent. Keeping the message log in the sender's local memory allows us to recover from a single failure at a time without the expense of synchronously logging each message to stable storage. The message log is then asynchronously written to stable storage, without delaying the computation, as part of the sender's periodic checkpoint. Maintaining the sender-based message log requires at most one extra network packet over non-fault-tolerant reliable message communication and imposes little additional synchronization delay. It can be applied transparently to existing distributed applications and does not require specialized hardware. It is currently being implemented on a network of SUN workstations.
Closure and Convergence: A Foundation of Fault-Tolerant Computing
- IEEE Transactions on Software Engineering
, 1993
"... We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the ..."
Abstract
-
Cited by 103 (28 self)
- Add to MetaCart
We give a formal definition of what it means for a system to "tolerate" a class of "faults". The definition consists of two conditions: One, if a fault occurs when the system state is within a set of "legal" states, the resulting state is within some larger set and, if faults continue occurring, the system state remains within that larger set (Closure). And two, if faults stop occurring, the system eventually reaches a state within the legal set (Convergence). We demonstrate the applicability of our definition for specifying and verifying the fault-tolerance properties of a variety of digital and computer systems. Further, using the definition, we obtain a simple classification of fault-tolerant systems and discuss methods for their systematic design. as traditionally been studied in the context of specifi...
Implementing atomic actions on decentralized data
- ACM Transactions on Computer Systems
, 1983
"... Synchronization of accesses to shared data and recovering the state of such data in the case of failures are really two aspects of the same problem--implementing atomic actions on a related set of data items. In this paper a mechanism that solves both problems simultaneously in a way that is compati ..."
Abstract
-
Cited by 90 (3 self)
- Add to MetaCart
Synchronization of accesses to shared data and recovering the state of such data in the case of failures are really two aspects of the same problem--implementing atomic actions on a related set of data items. In this paper a mechanism that solves both problems simultaneously in a way that is compatible with requirements of decentralized systems is described. In particular, the correct construction and execution of a new atomic action can be accomplished without knowledge of all other atomic actions in the system that might execute concurrently. Further, the mechanisms degrade gracefully if parts of the system fail: only those atomic actions that require resources in failed parts of the system are prevented from executing, and there is no single coordinator that can fail and bring down the whole system.
Randomized Instruction Set Emulation To Disrupt Binary . . .
- ACM TRANSACTIONS ON INFORMATION SYSTEM SECURITY
, 2003
"... Many remote attacks against computer systems inject binary code into the execution path of a running program, gaining control of the program's behavior. If each defended system or program could use a machine instruction set that was both unique and private, such binary code injection attacks woul ..."
Abstract
-
Cited by 88 (3 self)
- Add to MetaCart
Many remote attacks against computer systems inject binary code into the execution path of a running program, gaining control of the program's behavior. If each defended system or program could use a machine instruction set that was both unique and private, such binary code injection attacks would become extremely difficult if not impossible. A binary-to-binary translator provides an economic and flexible implementation path for realizing that idea. As a proof of concept, we describe a randomized instruction set emulator (RISE) based on the open-source Valgrind x86-to-x86 binary translator. Although currently very slow and memory-intensive, our prototype RISE can indeed disrupt binary code injection attacks against a program without requiring its recompilation, linking, or access to source code. We describe the RISE implementation, give evidence demonstrating that RISE defeats common attacks, consider consequences of the dense x86 instruction set on the method's effects, and discuss limitations of the RISE prototype as well as design tradeoffs and extensions of the underlying idea.

