Results 1 -
5 of
5
Commercial Fault Tolerance: A Tale of Two Systems
- IEEE Transactions on Dependable and Secure Computing
, 2004
"... Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a “fail-fast ” philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails. Index Terms—Computer systems implementation, fault tolerance, high availability. 1
Configurable isolation: building high availability systems with commodity multi-core processors
- In Proceedings of the 34th Annual International Symposium on Computer Architecture
, 2007
"... High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, costeffective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high.
REDAC: Distributed, Asynchronous Redundancy in Shared Memory Servers
"... The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing soft- and hard-error rates in commodity designs. Because modern chip designs consist of multiple high-speed clock domains, conventional lockstepped redundant execution is no longer p ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing soft- and hard-error rates in commodity designs. Because modern chip designs consist of multiple high-speed clock domains, conventional lockstepped redundant execution is no longer practical. Recent work suggests an asynchronous approach to redundant execution, where processor pairs independently execute an instruction stream and treat any differences like soft errors, invoking rollback recovery. Because prior designs buffer instruction results within the out-of-order instruction window, they are limited to tightly coupled redundancy within a single chip, which limits availability and serviceability in the presence of hard errors. We propose REDAC, a set of lightweight mechanisms for distributed, asynchronous redundancy within a sharedmemory multiprocessor. REDAC provides scalable buffering for unchecked state updates, permitting the distribution of redundant execution across multiple nodes of a scalable shared-memory server. The REDAC mechanisms achieve high performance by enabling speculation across common serializing instructions and mitigating the effects of input incoherence. We evaluate REDAC using cycle-accurate fullsystem simulation of common enterprise workloads and show that performance overheads average just 10 % when compared to a non-redundant system. These results are comparable to the performance of a similarly configured lockstep design, but offer the substantial benefits of asynchronous redundancy. 1
The Granularity of Soft-Error Containment in Shared Memory Multiprocessors
"... Abstract—Concerns over rising soft-error rates in processor logic has led to numerous proposals for error tolerance mechanisms. In this paper, we examine the role of soft-error containment in a shared memory multiprocessor. We study a range of design alternatives based on how far outside the process ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Concerns over rising soft-error rates in processor logic has led to numerous proposals for error tolerance mechanisms. In this paper, we examine the role of soft-error containment in a shared memory multiprocessor. We study a range of design alternatives based on how far outside the processor core errors are allowed to propagate. We discuss tradeoffs in recovery complexity and error-free performance that arise from the choice of containment granularity. I.
Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
"... Abstract—Distributed shared-memory (DSM) multiprocessors provide a scalable hardware platform, but lack the necessary redundancy for mainframe-level reliability and availability. Chip-level redundancy in a DSM server faces a key challenge: the increased latency to check results among redundant compo ..."
Abstract
- Add to MetaCart
Abstract—Distributed shared-memory (DSM) multiprocessors provide a scalable hardware platform, but lack the necessary redundancy for mainframe-level reliability and availability. Chip-level redundancy in a DSM server faces a key challenge: the increased latency to check results among redundant components. To address performance overheads, we propose a checking filter that reduces the number of checking operations impeding the critical path of execution. Furthermore, we propose to decouple checking operations from the coherence protocol, which simplifies the implementation and permits reuse of existing coherence controller hardware. Our simulation results of commercial workloads indicate average performance overhead is within 4 % (9 % maximum) of tightly coupled DMR solutions. I.

