Results 1 - 10
of
66
Consul: A Communication Substrate for Fault-Tolerant Distributed Programs
- DISTRIBUTED SYSTEMS ENGINEERING JOURNAL
, 1991
"... Replicating important services on multiple processors in a distributed architecture is a common technique for constructing dependable computing systems. This paper describes a communication substrate, called Consul, that facilitates the development of such systems by providing a collection of fun ..."
Abstract
-
Cited by 118 (22 self)
- Add to MetaCart
Replicating important services on multiple processors in a distributed architecture is a common technique for constructing dependable computing systems. This paper describes a communication substrate, called Consul, that facilitates the development of such systems by providing a collection of fundamental abstractions for constructing fault-tolerant programs based on replicated processing. These abstractions include a multicast service, a membership service, and a recovery service. Consul is unique in two respects. First, its services are implemented using a collection of algorithms that exploit the partial (or causal) ordering of messages exchanged in the system. Such algorithms are generally more efficient than those that depend on a total ordering of events. Second, its underlying architecture is configurable, thereby allowing a system to be structured according to the needs of the application. The paper sketches Consul's architecture, presents the algorithms used by its pr...
Coyote: A System for Constructing Fine-Grain Configurable Communication Services
- ACM Transactions on Computer Systems
, 1998
"... Communication-oriented abstractions such as atomic multicast, group RPC, and protocols for location-independent mobile computing can simplify the development of complex applications built on distributed systems. This paper describes Coyote, a system that supports the construction of highly modular ..."
Abstract
-
Cited by 85 (15 self)
- Add to MetaCart
Communication-oriented abstractions such as atomic multicast, group RPC, and protocols for location-independent mobile computing can simplify the development of complex applications built on distributed systems. This paper describes Coyote, a system that supports the construction of highly modular and configurable versions of such abstractions. Coyote extends the notion of protocol objects and hierarchical composition found in existing systems with support for finer-grain objects called micro-protocols that implement individual semantic properties of the target service. A customized service is constructed by selecting micro-protocols based on their semantic guarantees and configuring them together with a standard runtime system to form a composite protocol implementing the service. Micro-protocols within a composite protocol can share data and are executed using an event-driven paradigm that enhances configurability. The overall approach is described and illustrated with exampl...
A System for Constructing Configurable High-Level Protocols
- in Proceedings of SIGCOMM '95
, 1995
"... New distributed computing applications are driving the development of more specialized protocols, as well as demanding greater control over the communication substrate. Here, a network subsystem that supports modular, finegrained construction of high-level protocols such as atomic multicast and grou ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
New distributed computing applications are driving the development of more specialized protocols, as well as demanding greater control over the communication substrate. Here, a network subsystem that supports modular, finegrained construction of high-level protocols such as atomic multicast and group RPC is described. The approach is based on extending the standard hierarchical model of the x-kernel with composite protocols in which micro-protocol objects are composed within a standard runtime framework. Each micro-protocol realizes a separate semantic property, leading to a highly modular and configurable implementation. In contrast with similar systems, this approach provides finer granularity and more flexible inter-object communication. The design and prototype implementation runing on Mach are described. Performance results are also given for a micro-protocol suite implementing variants of group RPC. 1 Introduction Network protocols that are implemented at high levels of the pro...
Tspaces: The next wave
- Hawaii Intl. Conf. on System Sciences (HICSS-32
, 1999
"... Millions of small heterogeneous computers are poised to spread into the infrastructure of our society. Though mostly inconspicuous today, disguised as nothing more than PIM (personal information management) computers, these tiny processors will eventually pervade most aspects of civilized life. The ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
Millions of small heterogeneous computers are poised to spread into the infrastructure of our society. Though mostly inconspicuous today, disguised as nothing more than PIM (personal information management) computers, these tiny processors will eventually pervade most aspects of civilized life. The one thing holding them back from being everyone's portal to the new electronic society and the access point to an in nite store of information is the lack of a high-quality logical link to the world's network backbone. Enter T Spaces, a network middlewarepackage for the new age of ubiquitous computing. T Spaces is a tuplespace-based network communication bu er with database capabilities that enables communication between applications and devices in a network of heterogeneous computers and operating systems. With T Spaces, it is possible to connect all computers together, which leads the way towards an in nitely large cluster of cooperating machines. In this paper we describe the TSpaces package and explore some distributed applications that use T Spaces. 1
LIME: A coordination model and middleware supporting mobility of hosts and agents
- ACM Transactions on Software Engineering and Methodology
, 2006
"... Lime (Linda in a Mobile Environment) is a model and middleware supporting the development of applications that exhibit physical mobility of hosts, logical mobility of agents, or both. Lime adopts a coordination perspective inspired by work on the Linda model. The context for computation, represented ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
Lime (Linda in a Mobile Environment) is a model and middleware supporting the development of applications that exhibit physical mobility of hosts, logical mobility of agents, or both. Lime adopts a coordination perspective inspired by work on the Linda model. The context for computation, represented in Linda by a globally accessible, persistent tuple space, is refined in Lime to transient sharing of identically-named tuple spaces carried by individual mobile units. Tuple spaces are also extended with a notion of location and programs are given the ability to react to specified states. The resulting model provides a minimalist set of abstractions that facilitate rapid and dependable development of mobile applications. In this paper, we illustrate the model underlying Lime, provide a formal semantic characterization for the operations it makes available to the application developer, present its current design and implementation, and discuss lessons learned in developing applications that involve physical mobility.
Adaptive Scheduling for Task Farming with Grid Middleware
, 1999
"... Scheduling in metacomputing environments is an active field of research as the vision of a Computational Grid becomes more concrete. An important class of Grid applications are long-running parallel computations with large numbers of somewhat independent tasks (Monte-Carlo simulations, parameter- ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Scheduling in metacomputing environments is an active field of research as the vision of a Computational Grid becomes more concrete. An important class of Grid applications are long-running parallel computations with large numbers of somewhat independent tasks (Monte-Carlo simulations, parameter-space searches, etc.). A number of Grid middleware projects are available to implement such applications but scheduling strategies are still open research issues. This is mainly due to the diversity of both Grid resource types and of their availability patterns. The purpose of this work is to develop and validate a general adaptive scheduling algorithm for task farming applications along with a user interface that makes the algorithm accessible to domain scientists. Our algorithm is general in that it is not tailored to a particular Grid middleware and that it requires very few assumptions concerning the nature of the resources. Our first testbed is NetSolve as it allows quick and ea...
LIME: A Coordination Middleware Supporting Mobility of Hosts and Agents
, 2003
"... LIME (Linda in a Mobile Environment) is a middleware supporting the development of applications that exhibit physical mobility of hosts, logical mobility of agents, or both. LIME adopts a coordination perspective inspired by work on the Linda model. The context for computation, represented in Lind ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
LIME (Linda in a Mobile Environment) is a middleware supporting the development of applications that exhibit physical mobility of hosts, logical mobility of agents, or both. LIME adopts a coordination perspective inspired by work on the Linda model. The context for computation, represented in Linda by a globally accessible, persistent tuple space, is refined in LIME to transient sharing of identically-named tuple spaces carried by individual mobile units. Tuple spaces are also extended with a notion of location and programs are given the ability to react to specified states. The resulting model provides a minimalist set of abstractions that promise to facilitate rapid and dependable development of mobile applications. In this paper, we illustrate the model underlying LIME, provide a formal semantic characterization for the operations it makes available to the application developer, present its current design and implementation, and discuss lessons learned in developing applications that involve physical mobility.
PLinda 2.0: A Transactional/Checkpointing Approach to Fault Tolerant Linda
- In Proceedings of the 13th Symposium on Reliable Distributed Systems
, 1994
"... Robust parallel computation in Linda requires both tuple space and processes to be resilient to failure. In this paper, we present PLinda 2.0, set of extensions to Linda to support robust parallel computation on loosely coupled processors communicating over a network. The principal extensions of PLi ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Robust parallel computation in Linda requires both tuple space and processes to be resilient to failure. In this paper, we present PLinda 2.0, set of extensions to Linda to support robust parallel computation on loosely coupled processors communicating over a network. The principal extensions of PLinda 2.0 to Linda are transaction mechanisms for reliable tuple space and process-private logging mechanisms for resilient processes. The transaction mechanisms support two kinds of tuple space: stable tuple space always guaranteed to reflect state as of last committed transaction, and unstable tuple space protected by a transaction-consistent checkpoint. The process-private logging mechanisms are provided as tools for a process checkpointing scheme. These mechanisms allow the customization of checkpointing and recovery operations in each process to achieve low runtime overhead. 1 Introduction One of the issues that distributed programming systems must address is fault tolerance[4]. On loos...
Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations
, 1995
"... This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW envir ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.
Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
, 1997
"... Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolera ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.

