Results 1 - 10
of
15
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
, 1995
"... Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkp ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56 nodes Intel Paragon. 1 Introducti...
Lightweight Logging for Lazy Release Consistent Distributed Shared Memory
- In Proc. of the USENIX 2nd Symp. on Operating Systems Design and Implementation
, 1996
"... This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to the memory consistency protocol: it sends no additional messages during failure-free periods; it adds only a minimal amount of data to one of the DSM protocol messages; it introduces no forced rollbacks of non-faulty processes; and it performs no communication-induced accesses to stable storage. Furthermore, the algorithm logs only a very small amount of data, because it uses the log of memory accesses already maintained by the memory consistency protocol. The algorithm was implemented in TreadMarks, a state-of-the-art DSM system. Experimental results show that the algorithm has near zero time overhead and very low space overhead during failure-free execution, thus refuting the common belie...
A Survey Of Recoverable Distributed Shared Memory Systems
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1995
"... Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a dsm system increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable dsm systems (rdsm) that provide a checkpointing mechanism to restart parallel computations, after a site failure.
A Memory Approach to Consistent, Reliable Distributed Shared Memory
, 1995
"... Fault-tolerant distributed shared memory systems do not always need to support a complete and consistent recovery after a failure. We describe a framework, within which different approaches to, and different degrees of consistency and recoverability can be understood. The addition of consistent fail ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Fault-tolerant distributed shared memory systems do not always need to support a complete and consistent recovery after a failure. We describe a framework, within which different approaches to, and different degrees of consistency and recoverability can be understood. The addition of consistent failure recovery may be approached from two different viewpoints: either by an application-oriented view or a memoryoriented view. The major characteristics used in our framework are variations of availability, consistency, and application support. This paper explains the basic model, which is used in Reliable Mirage+, and describes how the framework can be used by other researchers to understand and classify solutions to the reliable DSM problem. The model distinguishes a recoverable system, which must be able to survive any single-site failure, from a reliable system which also ensures consistency after the recovery. Since consistency requirements may impose a high penalty on standard op...
Overview of distributed shared memory
- Trinity College Dublin
, 1998
"... So much has already been written about everything that you can't nd out anything about it. | James Thurber, Lanterns and Lances (1961) Loosely-coupled distributed systems haveevolved using message passing as the main paradigm for sharing information. Other paradigms used in loosely-coupled distribut ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
So much has already been written about everything that you can't nd out anything about it. | James Thurber, Lanterns and Lances (1961) Loosely-coupled distributed systems haveevolved using message passing as the main paradigm for sharing information. Other paradigms used in loosely-coupled distributed systems, such as rpc, are usually implemented on top of an underlying message-passing system. On the other hand, in tightly-coupled architectures, such asmulti-processor machines, the paradigm is usually based on shared memory with its attractively simple programming model. The shared-memory paradigm has recently been extended for use in more loosely-coupled architectures and is known as distributed shared memory (dsm [153, 178,58]) in this context. This chapter discusses some of the issues involved in the design and implementation of such adsm in loosely-coupled distributed systems and brie y discusses related work in other elds. In dsm systems, processes share data transparently across node boundaries � data faulting, location, and movement are handled by thedsm system. Among other things, this allows parallel programs designed to use the shared-memory abstraction to execute without modi cation on a
The Boundary-Restricted Coherence Protocol for Scalable and Highly Available Distributed Shared Memory Systems
, 1996
"... Larger size networks require Distributed Shared Memory (DSM) coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Larger size networks require Distributed Shared Memory (DSM) coherence protocols which scale well. Fault-tolerance in terms of high availability is required for data access and for uninterrupted DSM service since large-scale environments have a greater number of potentially malfunctioning components. We present a new class of coherence protocols for DSM systems whose instances offer highly available access to shared data at low operation costs. The protocols proposed scale well; an increase in the number of client sites does not increase the operation costs after a certain threshold has been reached. The results presented in this paper give strong guidelines for the overall design of DSM systems which offer highly available, uninterrupted services. Keywords: Distributed Systems, Scalability, Fault-Tolerance, Availability, Distributed Shared Memory, Coherence Protocols 2 1 Introduction Distributed Shared Memory (DSM) systems provide an attractive programming model to application pr...
Evaluation of OO7 as a system and an application benchmark
, 1995
"... OO7 has been widely used by developers to benchmark commercial Object Oriented Data Bases (OODB) and by researchers as a realistic workload for experimenting with persistent object systems (POS). These uses of OO7 levy very different requirements; the former needs an application benchmark while the ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
OO7 has been widely used by developers to benchmark commercial Object Oriented Data Bases (OODB) and by researchers as a realistic workload for experimenting with persistent object systems (POS). These uses of OO7 levy very different requirements; the former needs an application benchmark while the latter needs a system benchmark. This paper describes our experiences with using OO7 both as an application benchmark and a system benchmark. Based on this experience, we outline a framework of the features needed for both kinds of benchmarks. We evaluated OO7 using this framework and found it unsuitable for these tasks. 1 Introduction The construction of many large applications that manipulate complex data structures has motivated significant academic research and industrial development of Object Oriented Databases (OODB) and Persistent Object Stores (POS). In terms of the feature set supported, we will consider an OODB to be a superset of a POS. Both builders and users of such systems wou...
How to Scale Transactional Storage Systems
- In Proceedings of SIGOPS European Workshop on Operating System Support for World Wide Applications (Connemara
, 1996
"... Applications of the future will need to support large numbers of clients and will require scalable storage systems that allow state to be shared reliably. Recent research in distributed file systems provides technology that increases the scalability of storage systems. But file systems only support ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Applications of the future will need to support large numbers of clients and will require scalable storage systems that allow state to be shared reliably. Recent research in distributed file systems provides technology that increases the scalability of storage systems. But file systems only support sharing with weak consistency guarantees and can not support applications that require transactional consistency. The challenge is how to provide scalable storage systems that support transactional applications. We are developing technology for scalable transactional storage systems. Our approach combines scalable caching and coherence techniques developed in serverless file systems and DSM systems, with recovery techniques developed in traditional databases. This position paper describes the design rationale for split caching, a new scalable memory management technique for network-based transactional object storage systems, and fragment reconstruction, a new coherence protocol that supports...
Cooperative Caching And Prefetching In Parallel/distributed File Systems
, 1997
"... If we examine the structure of the applications that run on parallel machines, we observe that their I/O needs increase tremendously every day. These applications work with very large data sets which, in most cases, do not fit in memory and have to be kept in the disk. The input and output data file ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
If we examine the structure of the applications that run on parallel machines, we observe that their I/O needs increase tremendously every day. These applications work with very large data sets which, in most cases, do not fit in memory and have to be kept in the disk. The input and output data files are also very large and have to be accessed very fast. These large applications also want to be able to checkpoint themselves without wasting too much time. These facts constantly increase the expectations placed on parallel and distributed file systems. Thus, these file systems have to improve their performance to avoid becoming the bottleneck in parallel/distributed environments. On the other hand, while the performance of the new processors, interconnection networks and memory increases very rapidly, no such thing happens with the disk performance. This lack of improvement is due to the mechanical parts used to build the disks. These components are slow and limit both the latency and t...
Fragment Reconstruction: Providing Global Cache Coherence in a Transactional Storage System
- In Proceedings of the 17th International Conference on Distributed Computing Systems
, 1997
"... Cooperative caching is a promising technique to avoid the increasingly formidable disk bottleneck problem in distributed storage systems; it reduces the number of disk accesses by servicing client cache misses from the caches of other clients. However, existing cooperative caching techniques do not ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Cooperative caching is a promising technique to avoid the increasingly formidable disk bottleneck problem in distributed storage systems; it reduces the number of disk accesses by servicing client cache misses from the caches of other clients. However, existing cooperative caching techniques do not provide adequate support for fine-grained sharing. In this paper, we describe a new storage system architecture, split caching, and a new cache coherence protocol, fragment reconstruction, that combine cooperative caching with efficient support for fine-grained sharing and transactions. We also present the results of performance studies that show that our scheme introduces little overhead over the basic cooperative caching mechanism and provides better performance when there is fine-grained sharing. 1

