Results 1 - 10
of
52
Understanding Fault-Tolerant Distributed Systems
- Communications of the ACM
, 1993
"... We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design ..."
Abstract
-
Cited by 296 (23 self)
- Add to MetaCart
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue to provid...
Reaching Agreement on Processor Group Membership in Synchronous Distributed Systems
- Distributed Computing
, 1991
"... Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subj ..."
Abstract
-
Cited by 125 (14 self)
- Add to MetaCart
Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subject to partition occurrences, we specify the processor-group membership problem and we propose three simple protocols for solving it. The protocols provide all correct processors with consistent views of the processor-group membership and guarantee bounded processor failure detection and join delays. Key words: Communication network -- Distributed system -- Failure detection -- Fault tolerance -- Real time system -- Replicated data 1 Introduction When designing a computing service that must remain available despite component failures, a key idea is to replicate service state information at several servers running on distinct processors. The service state typically consists of the ser...
Maintaining Availability in Partitioned Replicated Databases
- ACM Transactions on Database Systems
, 1989
"... In a replicated database, a data item may have copies residing on several sites. A replica control protocol is necessary to ensure that data items with several copies behave as if they consist of a single copy, as far as users can tell. We describe a new replica control protocol that allows the acce ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
In a replicated database, a data item may have copies residing on several sites. A replica control protocol is necessary to ensure that data items with several copies behave as if they consist of a single copy, as far as users can tell. We describe a new replica control protocol that allows the accessing of data in spite of site failures and network partitioning. This protocol provides the database designer with a large degree of flexibility in deciding the degree of data availability, as well as the cost of accessing data.
Lazy Replication: Exploiting the Semantics of Distributed Services
- IN IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON OPERATING SYSTEMS AND APPLICATION ENVIRONMENTS
, 1990
"... To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication a ..."
Abstract
-
Cited by 86 (2 self)
- Add to MetaCart
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. In this paper, we propose lazy replication as a way to preserve consistency by exploiting the semantics of the service's operations to relax the constraints on ordering. Three kinds of operations are supported: operations for which the clients define the required order dynamically during the execution, operations for which the service defines the order, and operations that must be globally ordered with respect to both client ordered and service ordered operations. The method performs well in terms of response time, amount of stored state, number of messages, and availability. It is especially well suited to applications in which most operations require only the client-defined order.
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks
- In DISC
, 2002
"... This paper presents an algorithm that emulates atomic read/write shared objects in a dynamic network setting. To ensure availability and fault-tolerance, the objects are replicated. To ensure atomicity, reads and writes are performed using quorum configurations, each of which consists of a set of me ..."
Abstract
-
Cited by 85 (11 self)
- Add to MetaCart
This paper presents an algorithm that emulates atomic read/write shared objects in a dynamic network setting. To ensure availability and fault-tolerance, the objects are replicated. To ensure atomicity, reads and writes are performed using quorum configurations, each of which consists of a set of members plus sets of read-quorums and write-quorums. The algorithm is reconfigurable: the quorum configurations may change during computation, and such changes do not cause violations of atomicity. Any quorum configuration may be installed at any time. The algorithm tolerates processor stopping failure and message loss. The algorithm performs three major tasks, all concurrently: reading and writing objects, introducing new configurations, and "garbage-collecting" obsolete configurations.
An economic paradigm for query processing and data migration
- in Mariposa, Proc. 3rd International Conf. Parallel and Distributed Information Systems
, 1994
"... Many new database applications require very large volumes of data. Mariposa is a data base system under construction at Berkeley responding to this need. Mariposa objects can be stored over thousands of autonomous sites and on memory hierarchies with very large capacity. This scale of the system lea ..."
Abstract
-
Cited by 80 (1 self)
- Add to MetaCart
Many new database applications require very large volumes of data. Mariposa is a data base system under construction at Berkeley responding to this need. Mariposa objects can be stored over thousands of autonomous sites and on memory hierarchies with very large capacity. This scale of the system leads to complex query execution and storage management issues, unsolvable in practice with traditional techniques. We propose an economic paradigm as the solution. A query receives a budget which itspends to obtain the answers. Each site attempts to maximize income by buying and selling storage objects, and processing queries for locally stored objects. We present the protocols which underlie the Mariposa economy. 1.
Dynamic quorum adjustment for partitioned data
- ACM Transactions on Database Systems
, 1987
"... A partition occurs when functioning sites in a distributed system are unable to communicate. This paper introduces a new method for managing replicated data objects in the presence of partitions. Each operation provided by a replicated object has a set. of quorums, which are sets of sites whose coop ..."
Abstract
-
Cited by 68 (2 self)
- Add to MetaCart
A partition occurs when functioning sites in a distributed system are unable to communicate. This paper introduces a new method for managing replicated data objects in the presence of partitions. Each operation provided by a replicated object has a set. of quorums, which are sets of sites whose cooperation suffices to execute the operation. The method permits an object’s quorums to be adjusted dynamically in response to failures and recoveries. A transaction that is unable to progress using one set of quorums may switch to another, more favorable set, and transactions in different. partitions may progress using different sets. This method has three novel aspects: (1) it supports a wider range of quorums than earlier proposals, (2) it, scales up effectively to large systems because quorum adjustments do not require global reconfiguration, and (3) it, systematically exploits the semantics of typed objects to support more flexible quorum adjustment.
Viewstamped replication: A new primary copy method to support highly available distrbuted systems
- In 7th Symp. on Princ. of Distr. Comp. (PODC
, 1988
"... One of the potential benefits of distributed systems is their use in providing highly-available services that are likely to be usable when needed. Availabilay is achieved through replication. By having inore than one copy of information, a service continues to be usable even when some copies are ina ..."
Abstract
-
Cited by 59 (14 self)
- Add to MetaCart
One of the potential benefits of distributed systems is their use in providing highly-available services that are likely to be usable when needed. Availabilay is achieved through replication. By having inore than one copy of information, a service continues to be usable even when some copies are inaccessible, for example, because of a crash of the computer where a copy was stored. This paper presents a new replication algorithm that has desirable performance properties. Our approach is based on the primary copy technique. Computations run at a primary. which notifies its backups of what it has done. If the primary crashes, the backups are reorganized, and one of the backups becomes the new primary. Our method works in a general network with both node crashes and partitions. Replication causes little delay in user computations and little information is lost in a reorganization; we use a special kind of timestamp called a viewstamp to detect lost information. 1
High-Level Data Races
- JOURNAL ON SOFTWARE TESTING, VERIFICATION & RELIABILITY (STVR
, 2003
"... Data races are a common problem in concurrent programming. Experience shows that the notion of data race is not powerful enough to capture certain types of inconsistencies occurring in practice. In this paper we investigate data races on a higher abstraction layer. This enables us to detect incon ..."
Abstract
-
Cited by 52 (15 self)
- Add to MetaCart
Data races are a common problem in concurrent programming. Experience shows that the notion of data race is not powerful enough to capture certain types of inconsistencies occurring in practice. In this paper we investigate data races on a higher abstraction layer. This enables us to detect inconsistent uses of shared variables, even if no classical race condition occurs. For example, a data structure representing a coordinate pair may have to be treated atomically. By lifting

