Results 11 - 20
of
61
Efficient and flexible object sharing
- In Proceedings of the 1996 International Conference on Parallel Processing
, 1996
"... ..."
Deploying Fault Tolerance and Task Migration with NetSolve
, 1999
"... Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve. Key words: Fault-tolerance, Scientific Computing, Computational Servers, Checkpointing, Migration. 1 Introduction The advances in computer and network technologies that are shaping the global information infrastructure are also producing a new vision of how that infrastructure will be used. The concept of a Computational Power Grid has emerged ...
The region trap library: Handling traps on application-defined regions of memory
- In Proceedings of the 1999 USENIX Annual Tech. Conf
, 1999
"... User-level virtual memory (VM) primitives are used in many different application domains including distributed shared memory, persistent objects, garbage collection, and checkpointing. Unfortunately, VM primitives only allow traps to be handled at the granularity of fixedsized pages defined by the o ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
User-level virtual memory (VM) primitives are used in many different application domains including distributed shared memory, persistent objects, garbage collection, and checkpointing. Unfortunately, VM primitives only allow traps to be handled at the granularity of fixedsized pages defined by the operating system and architecture. In many cases, this results in a size mismatch between pages and application-defined objects that can lead to a significant loss in performance. In this paper we describe the design and implementation of a library that provides, at the granularity of application-defined regions, the same set of services that are commonly available at a page-granularity using VM primitives. Applications that employ the interface of this library, called the Region Trap Library (RTL), can create and use multiple objects with different levels of protection (i.e., invalid, read-only, or read-write) that reside on the same virtual memory page and trap only on read/write references to objects in an invalid state or write references to objects in a read-only state. All other references to these objects proceed at hardware speeds. Benchmarks of an implementation on five different OS/architecture combinations are presented along with a case study using region trapping within a distributed shared memory (DSM) system, to implement a regionbased version of the lazy release consistency (LRC) coherence protocol. Together, the benchmark results and the DSM case study suggest that region trapping mechanisms provide a feasible region-granularity alternative for application domains that commonly rely on pagebased virtual memory primitives. 1
The Boundary-Restricted Coherence Protocol for Scalable and Highly Available Distributed Shared Memory Systems
, 1996
"... Larger size networks require Distributed Shared Memory (DSM) coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Larger size networks require Distributed Shared Memory (DSM) coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components. We present a new class of coherence protocols for DSM systems whose instances offer highly available access to shared data at low operation costs. The protocols proposed scale well; an increase in the number of client sites does not increase the operation costs after a certain threshold has been reached. The results presented in this paper give strong guidelines for the overall design of DSM systems which offer highly available, uninterrupted services. Keywords: Distributed Systems, Scalability, Fault-Tolerance, Availability, Distributed Shared Memory, Coherence Protocols 2 1 Introduction Distributed Shared Memory (DSM) systems provide an attractive programming model to application pr...
Deriving Optimal Checkpoint Protocols for Distributed Shared Memory Architectures
- In Proceedings of the 1995 ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing Systems (PODC
, 1995
"... . Uncoordinated checkpointing is one technique used to build processes that can recover to a consistent state after crashing. This technique requires each process to periodically record its state in a checkpoint. Furthermore, the threads executing on each process log any nondeterministic action that ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
. Uncoordinated checkpointing is one technique used to build processes that can recover to a consistent state after crashing. This technique requires each process to periodically record its state in a checkpoint. Furthermore, the threads executing on each process log any nondeterministic action that they take following the latest checkpointed state. When a process crashes, a new process, initialized with the appropriate recorded local state, is created in its place. The new process restarts executing, and whenever one of its threads confronts a nondeterministic choice, the thread references the log in order to reproduce the same action performed before the crash. Thus, uncoordinated checkpointing implements an abstraction of a resilient process in which the crash of a process is translated into intermittent unavailability of that process. We give a specification of the consistency property "no orphan threads" in the context of multithreaded processes running on a shared memory multipro...
Design and Analysis of Highly Available and Scalable Coherence Protocols for Distributed Shared Memory Systems Using Stochastic Modeling
- PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING
, 1995
"... Larger size networks require DSM coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components. We present a new class ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Larger size networks require DSM coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components. We present a new class of coherence protocols for DSM systems whose instances offer highly available access to shared data at low operation costs. The protocols proposed scale well; an increase in the number of client sites does not increase the operation costs after a certain threshold has been reached. The results presented in this paper give strong guidelines for the overall design of DSM systems which offer highly available, uninterrupted services.
Limited-size Logging for Fault-Tolerant Distributed Shared Memory with Independent Checkpointing
, 2000
"... This paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of comput ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers as well as collaborative shared-memory applications on wide-area meta-clusters over the Internet. The challenge in building such systems lies in controlling the size of the logs and to garbage collect the unnecessary checkpoints in the absence of global coordination. In this paper we dene a set of rules for lazy log trimming (LLT) and checkpoint garbage collection (CGC) and prove that they do not aect the recoverability of the system. We have implemented our logging algorithm in a home-based DSM system and showed on three representative applications that our scheme eectively bounds the size of the logs and the number of checkpointed page versions kept in stable storage. 1 Int...
Using Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories
- in Distributed Shared Memories."TR 626, URCSD
, 1996
"... We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. We present results for implementations of our fault-tolerant technique using simulations of both TreadMarks, a software-only DSM, and Cashmere, a DSM using memory mapped hardware. We compare simulations with no fault tolerance to simulations with local disk logging and peer logging. We present results showing that fault-tolerant Treadmarks can be achieved with an average of 17 % overhead for peer logging. We also present results showing that while almost any DSM protocol can be made fault tolerant, systems with localized DSM page meta-data have much lower overheads.
Ensuring Correct Rollback Recovery In Distributed Shared Memory Systems
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.
Checkpointing Distributed Shared Memory
, 1997
"... Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, since it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few D ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, since it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fault-tolerance. In this paper, we present a checkpointing mechanism for a DSM system that is efficient and portable. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system. As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems. Keywords: Distributed Shared Memory, Checkpointing, Fault-Tolerance, Portabilit...