Results 1 - 10
of
19
The Transaction Concept: Virtues and Limitations
, 1981
"... ABSTRACT: A transaction is a transformation of state which has the properties of atomicity (all or nothing), durability (effects survive failures) and consistency (a correct transformation). The transaction concept is key to the structuring of data management applications. The concept may have appli ..."
Abstract
-
Cited by 235 (0 self)
- Add to MetaCart
ABSTRACT: A transaction is a transformation of state which has the properties of atomicity (all or nothing), durability (effects survive failures) and consistency (a correct transformation). The transaction concept is key to the structuring of data management applications. The concept may have applicability to programming systems in general. This paper restates the transaction concepts and attempts to put several implementation approaches in perspective. It then describes some areas which require further study: (1) the integration of the transaction concept with the notion of abstract data type, (2) some techniques to allow transactions to be composed of subtransactions,
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
- ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
Ficus: A Very Large Scale Reliable Distributed File System
- UNIVERSITY OF CALIFORNIA, LOS ANGELES
, 1991
"... The dissertation presents the issues addressed in the design of Ficus, a large scale wide area distributed file system currently operational on a modest scale at UCLA. Key aspects of providing such a service include toleration of partial operation in virtually all areas; support for large scale, ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
The dissertation presents the issues addressed in the design of Ficus, a large scale wide area distributed file system currently operational on a modest scale at UCLA. Key aspects of providing such a service include toleration of partial operation in virtually all areas; support for large scale, optimistic data replication; and a flexible, extensible modular design. Ficus incorporates a "stackable layers" modular architecture and full support for optimistic replication. Replication is provided by a pair of layers operating in concert above a traditional filing service. A "volume" abstraction and on-the-fly volume "grafting" mechanism are used to manage the large scale file name space. The replication service uses a f...
Commercial Fault Tolerance: A Tale of Two Systems
- IEEE Transactions on Dependable and Secure Computing
, 2004
"... Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
Abstract—This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop1 Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a “fail-fast ” philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails. Index Terms—Computer systems implementation, fault tolerance, high availability. 1
Fast Cluster Failover Using Virtual Memory-Mapped Communication
- In Proc. 13th International Conference on Supercomputing
, 1999
"... This paper proposes a novel way to use virtual memory mapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications' virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its ap ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
This paper proposes a novel way to use virtual memory mapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications' virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its applications can restart from the most recent checkpoints on the failover node with minimal memory copying and disk I/O overhead. This method requires little change to applications' source code. We developed two fast failover protocols: deliberate update failover protocol (DU) and automatic update failover protocol (AU). The rst can run on any system that supports VMMC, whereas the other requires special network interface support. We implemented these two protocols...
Software Environments for Cluster-based Display Systems
- First IEEE/ACM International Symposium on Cluster Computing and the Grid
, 2001
"... An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-perfor ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
An inexpensive way to construct a scalable display wall system is to use a cluster of PCs with commodity graphics accelerators to drive an array of projectors. A challenge is to bring off-the-shelf sequential applications to run on such a display wall efficiently without using expensive, high-performance interconnects. This paper studies two execution models for a scalable display wall system: master-slave and synchronized execution models. We have designed and implemented four software tools, two for each execution model, including VDD (Virtual Display Driver), GLP (GL-DLL Replacement), SSE (System-level Synchronized Execution), and ASE (Application-level Synchronized Execution). In order to support the synchronized execution model, we have also designed a broadcast, speculative file cache to provide scalable I/O performance. The paper reports our experimental results with several 3D applications on the display wall to understand the performance implications and tradeoffs of these methods. 1
Multi-site Declustering Strategies for Very High Database Service Availabiity. Thesis Norges Techn. Hogskoule
- Ph.D. thesis, The Norwegian Institute of Technology
, 1995
"... The thesis introduces the concept of multi-site declustering strategies with self repair for databases demanding very high service availability. Existing work on declustering strategies are centered around providing high performance and reliability inside a small geographical area (site). Applicatio ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
The thesis introduces the concept of multi-site declustering strategies with self repair for databases demanding very high service availability. Existing work on declustering strategies are centered around providing high performance and reliability inside a small geographical area (site). Applications demanding robustness against site failures like fire and power outages, can not use these methods. Such applications will often both replicate information inside one site and then replicate the site on another site and thus resulting in unnecessary high redundancy cost. Multi-site declustering provides robustness against site failures with only two replicas of data without compromising the performance and reliability. Self repair is proposed for reducing the probability of double-failures causing data loss and reducing the need for rapid replacement of failed hardware. A prerequisite for multi-site declustering with self repair is fast, long-distance, communication networks like ATM. The thesis shows how existing declustering strategies like Mirrored, Interleaved, Chained, and HypRa declustering can be used as multi-site declustering strategies. In addition a new strategy called Q-rot declustering is proposed. Compared with the others it gives
Hardware-Supported Fault Tolerance for Multiprocessors
- In Architecture of Computing Systems (ARCS’97
, 1997
"... To provide a computing system to be dependable fault tolerance mechanisms have to be included. Especially massive parallelism represents a new challenge for fault tolerance. In this paper we discuss basic hardware fault tolerance measures for massively parallel multiprocessors and solutions reali ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
To provide a computing system to be dependable fault tolerance mechanisms have to be included. Especially massive parallelism represents a new challenge for fault tolerance. In this paper we discuss basic hardware fault tolerance measures for massively parallel multiprocessors and solutions realized for and integrated into different multiprocessor architectures. Further we present our validation technique for dependability based on simulation-based fault injection.

