Results 1 -
7 of
7
Why Do Computers Stop And What Can Be Done About It?
, 1985
"... An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure. Various approaches to software fault-tolerance are then discussed -- notably process-pairs, transactions and reliable storage. It is p ..."
Abstract
-
Cited by 171 (0 self)
- Add to MetaCart
An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure. Various approaches to software fault-tolerance are then discussed -- notably process-pairs, transactions and reliable storage. It is pointed out that faults in production software are often soft (transient) and that a transaction mechanism combined with persistent processpairs provides fault-tolerant execution -- the key to software fault-tolerance.
Fault Tolerance In Tandem Computer Systems
, 1986
"... Tandem builds single-fault-tolerant computer systems. At the hardware level, the system is designed as a loosely coupled multi-processor with fail-fast modules connected via dual paths. It is designed for online diagnosis and maintenance. A range of CPUs may be inter-connected via a hierarchical fau ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Tandem builds single-fault-tolerant computer systems. At the hardware level, the system is designed as a loosely coupled multi-processor with fail-fast modules connected via dual paths. It is designed for online diagnosis and maintenance. A range of CPUs may be inter-connected via a hierarchical fault-tolerant local network. A variety of peripherals needed for online transaction processing are attached via dual ported controllers. A novel disc subsystem allows a choice between low cost-per-megabyte and low cost-per-access. System software provides processes and messages as the basic structuring mechanism. Processes provide software modularity and fault isolation. Process pairs tolerate hardware and transient software failures. Applications are structured as requesting processes making remote procedure calls to server processes. Process server classes utilize multi-processors. The resulting process abstractions provide a distributed system which can utilize thousands of processors. High-level networking protocols such as SNA, OSI, and a proprietary network are built atop this base. A relational database provides distributed data and distributed transactions. An application generator allows users to develop fault-tolerant applications as though the system were a conventional computer. The resulting system has price/performance competitive with conventional systems.
Efficient Transparent Application Recovery In Client-Server Information Systems
- In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data
, 1998
"... Database systems recover persistent data, providing high database availability. However, database applications, typically residing on client or "middle-tier" application-server machines, may lose work because of a server failure. This prevents the masking of server failures from the human user and ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Database systems recover persistent data, providing high database availability. However, database applications, typically residing on client or "middle-tier" application-server machines, may lose work because of a server failure. This prevents the masking of server failures from the human user and substantially degrades application availability. This paper aims to enable high application availability with an integrated method for database server recovery and transparent application recovery in a client-server system. The approach, based on application message logging, is similar to earlier work on distributed system fault tolerance. However, we exploit advanced database logging and recovery techniques and request/reply messaging properties to significantly improve efficiency. Forced log I/Os, frequently required by other methods, are usually avoided. Restart time, for both failed server and failed client, is reduced by checkpointing and log truncation. Our method ensures that a server...
On Energy Management, Load Balancing and Replication
"... In this paper we investigate some opportunities and challenges that arise in energy-aware computing in a cluster of servers running data-intensive workloads. We leverage the insight that servers in a cluster are often underutilized, which makes it attractive to consider powering down some servers an ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
In this paper we investigate some opportunities and challenges that arise in energy-aware computing in a cluster of servers running data-intensive workloads. We leverage the insight that servers in a cluster are often underutilized, which makes it attractive to consider powering down some servers and redistributing their load to others. Of course, powering down servers naively will render data stored only on powered down servers inaccessible. While data replication can be exploited to power down servers without losing access to data, unfortunately, care must be taken in the design of the replication and server power down schemes to avoid creating load imbalances on the remaining “live ” servers. Accordingly, in this paper we study the interaction between energy management, load balancing, and replication strategies for data-intensive cluster computing. In particular, we show that Chained Declustering – a replication strategy proposed more than 20 years ago – can support very flexible energy management schemes. 1
BUILDING DISTRIBUTED DAT ABASE SYSTEMS † by
"... Several design principles necessary to build high performance and reliable distributed database systems have evolved from conceptual research, prototype implementations, and experimentation during the eighties. This paper focuses on the important aspects of transaction processing, including: communi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Several design principles necessary to build high performance and reliable distributed database systems have evolved from conceptual research, prototype implementations, and experimentation during the eighties. This paper focuses on the important aspects of transaction processing, including: communication, concurrency, atomicity, replication, and recovery as they apply to distributed systems. Implementation of these in eleven experimental and commercial distributed systems are briefly presented. Details of atwelfth system called RAID that has been implemented by us are included to show our design and implementation strategies. The relationship between database and operating systems is discussed along with the desirable features in communication software for reliable processing. This material has been presented to demonstrate the practicality of certain successful design and implementation choices so as to benefit those who have the responsibility for making distributed systems work.
NonStop SQL, A Distributed, High-Performance, HighAvailability
, 1987
"... NonStop SQL is an implementation of ANSI SQL on Tandem Computer Systems. It provides distributed data and distributed execution. It can run on small computers and has been benchmarked at over 200 transactions per second on a large system. Hence, it is useable in both the information center and in pr ..."
Abstract
- Add to MetaCart
NonStop SQL is an implementation of ANSI SQL on Tandem Computer Systems. It provides distributed data and distributed execution. It can run on small computers and has been benchmarked at over 200 transactions per second on a large system. Hence, it is useable in both the information center and in production environments. NonStop SQL provides high-availability through a combination of NonStop device support and transaction mechanisms. The combination of SQL semantics and a message-based distributed operating system gives a surprising result: the message savings of a relational interface pay for the extra semantics of the SQL language when compared to record-at-a-time interfaces. This paper presents the system's design rational, and contrasts it to previous research prototypes and to other SQL implementations.
PN87614 Tandem TR 85.7 Why Do Computers Stop and What Can Be Done About It?
, 1985
"... An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure. Various approachs to software faulttolerance are then discussed notably process-pairs, transactions and reliable storage. It is pointe ..."
Abstract
- Add to MetaCart
An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure. Various approachs to software faulttolerance are then discussed notably process-pairs, transactions and reliable storage. It is pointed out that faults in production software are often soft (transient) and that a transaction mechanism combined with persistent process-pairs provides fault-tolerant execution-- the key to software fault-tolerance. DISCLAIMER This paper is not an "official " Tandem statement on fault-tolerance.

