Results 1 - 10
of
16
The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment
- In Proceedings USENIX Summer Conference
, 1992
"... As organizations with high system availability requirements move to UNIX, the elimination of down-time in the UNIX environment becomes a more important issue. Designing for fast recovery, rather than crash prevention, can provide low-cost highlyavailable systems without sacrificing performance or si ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
As organizations with high system availability requirements move to UNIX, the elimination of down-time in the UNIX environment becomes a more important issue. Designing for fast recovery, rather than crash prevention, can provide low-cost highlyavailable systems without sacrificing performance or simplicity. In Sprite, a UNIX-like distributed operating system, we accomplish this fast recovery in part through the use of a recovery box: a stable area of memory in which the system stores carefully selected pieces of system state, and from which the system can be regenerated quickly. Error detection using checksums allows the system to revert to its traditional reboot sequence if the recovery box data is corrupted during system failure. Recent statistics about the types and frequencies of operating system failures indicate that fast recovery using the recovery box will be possible most of the time. Using our recovery box implementation, a Sprite file server recovers in 26 seconds and a da...
On-Line Software Version Change
, 1994
"... what constitutes an "acceptable" behavior of such a process. We capture this notion in our definition of the validity of an on-line change. We define an on-line change to be valid if some time after the change, the process reaches a reachable state of the new program version. Thus, validity ensures ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
what constitutes an "acceptable" behavior of such a process. We capture this notion in our definition of the validity of an on-line change. We define an on-line change to be valid if some time after the change, the process reaches a reachable state of the new program version. Thus, validity ensures that following a change, the process starts behaving like the new version of the program after a "transition period". We first consider validity of on-line changes to programs written in sequential procedure based languages. For this purpose, a very simple model in which procedures and functions are not allowed is first considered. State is modelled as a mapping from variable names to values. For this model, we show that it is undecidable to find whether or not a given on-line change is valid. This result has important consequences. It means that computable necessary and sufficient conditions for validity of change can not be obtained. Undecidability in this simple model also
Software Environments for Cluster-based Display Systems
- First IEEE/ACM International Symposium on Cluster Computing and the Grid
, 2001
"... An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-perfor ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-performance interconnects. This paper studies two execution models for a scalable display wall system: master-slave and synchronized execution models. We have designed and implemented four software tools, two for each execution model, including VDD (Virtual Display Driver), GLP (GL-DLL Replacement), SSE (System-level Synchronized Execution), and ASE (Application-level Synchronized Execution). In order to support the synchronized execution model, we have also designed a broadcast, speculative file cache to provide scalable I/O performance. The paper reports our experimental results with several 3D applications on the display wall to understand the performance implications and tradeoffs of these methods. 1
Avoiding the Babbling-Idiot Failure in a Time-Triggered Communication System
- In International Symposium on Fault-Tolerant Computing (FTCS
, 1998
"... In a distributed hard real-time system based on a broadcast bus for inter-node communication it is important to prevent a single faulty node from monopolizing the communication bus. In a time-triggered system, in which messages are broadcasted according to a pre-determined transmission pattern, this ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
In a distributed hard real-time system based on a broadcast bus for inter-node communication it is important to prevent a single faulty node from monopolizing the communication bus. In a time-triggered system, in which messages are broadcasted according to a pre-determined transmission pattern, this kind of failure is characterized by the faulty node transmitting messages at arbitrary points in time thus corrupting the transmissions on the bus. This type of failure is known as the babbling-idiot failure. Within the presented approach a special device, the bus guardian, is added to each node to protect the communication bus from the babbling-idiot failure. The regular transmission pattern of a time-triggered system is exploited in order to enforce a fail-silent behaviour of the node in the time domain. The paper describes the requirements imposed on the bus guardian to enforce fail-silent behaviour of the node. The mechanisms of the bus guardian are presented along with the node archit...
Dynamic Verification of Cache Coherence Protocols
, 2001
"... A method for improving the fault-tolerance of cache coherent multiprocessors is proposed. By dynamically verifying coherence operations in hardware, errors caused by manufacturing faults, soft errors, and design mistakes can be detected. Analogous to the DIVA concept for singleprocessor systems, a s ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
A method for improving the fault-tolerance of cache coherent multiprocessors is proposed. By dynamically verifying coherence operations in hardware, errors caused by manufacturing faults, soft errors, and design mistakes can be detected. Analogous to the DIVA concept for singleprocessor systems, a simple version of the protocol functions as a checker for the aggressive implementation. An example implementation is shown, and the overhead is estimated for a small SMP system.
Tolerating Operational Faults in Cluster-based FPGAs
- IN 8TH INTERNATIONAL ACM/SIGDA SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS
, 2000
"... In recent years the application space of reconfigurable devices has grown to include many platforms with a strong need for fault tolerance. While these systems frequently contain hardware redundancy to allow for continued operation in the presence of operational faults, the need to recover faulty ha ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
In recent years the application space of reconfigurable devices has grown to include many platforms with a strong need for fault tolerance. While these systems frequently contain hardware redundancy to allow for continued operation in the presence of operational faults, the need to recover faulty hardware and return it to full functionality quickly and efficiently is great. In addition to providing functional density, FPGAs provide a level of fault tolerance generally not found in mask-programmable devices by including the capability to reconfigure around operational faults in the field. In this paper, incremental CAD techniques are described that allow functional recovery of FPGA design configurations in the presence of single or multiple operational faults. Our preferred approach to fault recovery takes advantage of device routing hierarchy in architectural families such as Xilinx Virtex [2] and Altera Apex [3] to quickly swap unused logic and routing resources in place of faulty ones within logic clusters. These algorithms allow for straightforward implementation within a local fault-tolerant system without the need to access a remote processing location. If initial recovery attempts through localized swapping fail, an incremental router based on the widely-used PathFinder maze routing algorithm [10] can be applied remotely in an attempt to form connections between newly-allocated logic and interconnect based on the history of the initial design route.
System Support for Software Fault Tolerance in Highly Available Database Management Systems
, 1992
"... Today, software errors are the leading cause of outages in fault tolerant systems. System availability can be improved despite software errors by fast error detection and recovery techniques that minimize total downtime after an outage. This dissertation analyzes software errors in three commercial ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Today, software errors are the leading cause of outages in fault tolerant systems. System availability can be improved despite software errors by fast error detection and recovery techniques that minimize total downtime after an outage. This dissertation analyzes software errors in three commercial systems and describes the implementation and evaluation of several techniques for early error detection and fast recovery in a database management system (DBMS). The software error study examines errors reported by customers in three IBM systems programs: the MVS operating system and the IMS DBMS and DB2 DBMS. The study classifies errors by the type of coding mistake and the circumstances in the customer's environment that caused the error to arise. It observes a higher availability impact from addressing errors, such as uninitialized pointers, than software errors as a whole. It also details the frequencies and types of addressing errors and characterizes the damage they do. The error detec...
Transparent Fault-Tolerant Java Virtual Machine
, 2003
"... Replication is one of the prominent approaches for obtaining fault tolerance. Implementing replication on commodity hardware and in a transparent fashion, i.e., without changing the programming model, has many challenges. Deciding at what level to implement the replication has ramifications on devel ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Replication is one of the prominent approaches for obtaining fault tolerance. Implementing replication on commodity hardware and in a transparent fashion, i.e., without changing the programming model, has many challenges. Deciding at what level to implement the replication has ramifications on development costs and portability of the programs. Other difficulties lie in the coordination of the copies in the face of non-determinism.
Adaptive Fault Recovery for Networked Reconfigurable Systems
- In FCCM ’03: Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
, 2003
"... The device-level size and complexity of reconfigurable architectures makes fault tolerance an important concern in system design. In this paper, we introduce a fully-automated fault recovery system for networked systems which contain FPGAs. If a fault is detected that can not be addressed locally, f ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The device-level size and complexity of reconfigurable architectures makes fault tolerance an important concern in system design. In this paper, we introduce a fully-automated fault recovery system for networked systems which contain FPGAs. If a fault is detected that can not be addressed locally, fault information is transferred to a reconfiguration server. Following design recompilation to avoid the fault, a new FPGA configuration is returned to the remote system and computation is reinitiated. To illustrate the benefit of this approach, we have implemented a complete fault recovery system which requires no manual intervention. An important part of the system is a timing-driven incremental router for Xilinx Virtex devices. This router is directly interfaced to Xilinx JBits and uses no CAD tools from the standard Xilinx Alliance tool flow. Our completed system has been applied to three benchmark designs and exhibits complete fault recovery in up to 12 less time than the standard incremental Xilinx PAR flow.
Software Environments For Running Desktop Applications On A Scalable High-Resolution DisplayWall
, 2000
"... An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-perfor ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-performance interconnects. This paper studies two execution models for a scalable display wall system: master-slave and synchronized execution models. We have designed and implemented four software tools, two for each execution model, including VDD (Virtual Display Driver), GLP (GL-DLL Replacement), SSE (System-level Synchronized Execution) , and ASE (Application-level Synchronized Execution). In order to support the synchronized execution model, we have also designed a broadcast, speculative file cache to provide scalable I/O performance. The paper reports our experimental results with several 2D and 3D applications on the display wall to understand the performance implications and tradeoffs of ...

