Results 1 - 10
of
12
Improving Release-Consistent Shared Virtual Memory using Automatic Update
- IN THE 2ND IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1996
"... Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach a ..."
Abstract
-
Cited by 82 (20 self)
- Add to MetaCart
Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC, to 8.3 under AURC.
Home-based Shared Virtual Memory
, 1998
"... In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when pe ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when performance-enhancing techniques from all these areas are combined. This dissertation proposes home-based lazy release consistency as a simple, effective, and scalable way to build shared virtual memory systems. In home-based protocols each shared page has a home to which all writes are propagated and from which all copies are derived. Two home-based protocols are described, implemented and evaluated on two hardware and software platforms: Automatic Update Release Consistency (AURC), which requires hardware support for fine-grained remote writes (automatic updates), and Homebased Lazy Release Consistency (HLRC), which is implemented exclusively in software. The dissertation investigates the ...
Designing Memory Consistency Models for Shared-Memory Multiprocessors
, 1993
"... The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations ..."
Abstract
-
Cited by 51 (8 self)
- Add to MetaCart
The memory consistency model (or memory model) of a shared-memory multiprocessor system influences both the performance and the programmability of the system. The simplest and most intuitive model for programmers, sequential consistency, restricts the use of many performance-enhancing optimizations exploited by uniprocessors. For higher performance, several alternative models have been proposed. However, many of these are hardware-centric in nature and difficult to program. Further, the multitude of many seemingly unrelated memory models inhibits portability. We use a 3P criteria of programmability, portability, and performance to assess memory models, and find current models lacking in one or more of these criteria. This thesis establishes a unifying framework for reasoning about memory models that leads to models that adequately satisfy the 3P criteria. The first contribution of this thesis is a programmer-centric methodology, called sequential consistency normal form (SCNF), for specifying memory models. This methodology is based on the observation that performance enhancing optimizations can be allowed without violating sequential consistency if the system is given some information about the program. An SCNF model is a contract between the system and the programmer, where the system guarantees both high performance and sequential consistency only if the programmer provides certain information about the program. Insufficient information gives lower performance, but incorrect information
Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory
- In The 2nd IEEE Symposium on High-Performance Computer Architecture
, 1996
"... Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acce ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acceptable performance for only a limited class of applications. We argue that the principal sources of overhead in DSM systems can be dramatically reduced with modest amounts of hardware support (substantially less than is required for hardware cache coherence). Specifically, we present and evaluate a family of protocols designed to exploit hardware support for a global, but noncoherent, physical address space. We consider systems both with and without remote cache fills, fine-grain access faults, "doubled" writes to local and remote memory, and merging write buffers. We also consider varying levels of latency and bandwidth. We evaluate our protocols using execution driven simulation, comparing...
Software Cache Coherence for Large Scale Multiprocessors
- In Proceedings of the First International Symposium on High Performance Computer Architecture
, 1994
"... Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance-competitive solution. We claim that an intermediate hardware option---memory-mapped network interfaces that support a global physical address space---can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cache-write policies. We also observe that simple program changes can greatly...
High performance software coherence for current and future architectures
- Journal of Parallel and Distributed Computing
, 1995
"... Shared memory provides an attractive and intuitive pro-gramming model for large-scale parallel computing, but re-quires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shar ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
Shared memory provides an attractive and intuitive pro-gramming model for large-scale parallel computing, but re-quires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance var-ies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems pro-vide acceptable performance on only a limit class of applica-tions. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hard-ware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of ma-chines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA ma-chines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. a 1995 Academic press, I~C. 1.
Efficient Replication of Large Data Objects
- In Proceedings of the 17th International Symposium on Distributed Computing (DISC-17
, 2003
"... We present a new distributed data replication algorithm tailored especially for large-scale read/write data objects such as files. The algorithm guarantees atomic data consistency, while incurring low latency costs. The key idea of the algorithm is to maintain copies of the data objects separately f ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present a new distributed data replication algorithm tailored especially for large-scale read/write data objects such as files. The algorithm guarantees atomic data consistency, while incurring low latency costs. The key idea of the algorithm is to maintain copies of the data objects separately from information about the locations of up-todate copies. Because it performs most of its work using only the location information, our algorithm needs to access only a few copies of the actual data; specifically, only one copy during a read and only f + 1 copies during a write, where f is an assumed upper bound on the number of copies that can fail. These bounds are optimal. The algorithm works in an asynchronous message-passing environment. It does not use additional mechanisms such as group communication or distributed locking. It is suitable for implementation in WANs as well as LANs. We also present two lower bounds on the costs of data replication. The first lower bound is on the number of low-level writes required during a read operation on the data. The second bound is on the minimum space complexity of a class of efficient replication algorithms. These lower bounds suggest that some of the techniques used in our algorithm are necessary. They are also of independent interest.
Distributed Shared Memory for New Generation Networks
, 1995
"... Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acce ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acceptable performance on only a limited class of applications. In this paper we study the main sources of overhead found in software-coherent, distributed shared-memory systems and argue that recent revolutionary changes in network technology now allow us to design protocols that minimize such overheads and that approach the performance of full hardware coherence. Specifically, we claim that memory-mapped network interfaces that support a global physical address space can greatly improve the performance of DSM systems. To support this claim we study a variety of coherence protocols that can take advantage of the global physical address space and compare their performance with the best known pro...
Design and Evaluation of a Software-Controlled COMA
, 1996
"... Traditionally, cache coherence in multiprocessors has been maintained in hardware with the support of a snooping protocol. However, the cost-effectiveness of building machines with hardwired protocols has recently been questioned. Research in Virtual Shared Memory systems, in which the software take ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Traditionally, cache coherence in multiprocessors has been maintained in hardware with the support of a snooping protocol. However, the cost-effectiveness of building machines with hardwired protocols has recently been questioned. Research in Virtual Shared Memory systems, in which the software takes advantage of the virtual memory system of the kernel to implement sharing, has shown that, even with the large latencies of kernel calls, software-based shared-memory is a viable alternative. We have developed a software protocol for a COMA (Cache-Only Memory Architecture) on a distributed network of processing elements. We call the system SCCOMA for Software-Controlled COMA, to emphasize that the coherence controllers are emulated by software executed on the node processor. Contrary to VSM systems, SCCOMA does not rely on kernel services and runs directly on the hardware, an approach we call Direct Software Emulation (or DSE) of shared memory. The software emulation layer has been written...
Efficient Shared Memory with Minimal Hardware Support
"... Shared memory is widely regarded as a more intuitive model than message passing for the development of parallel programs. A shared memory model can be provided by hardware, software, or some combination of both. One of the most important problems to be solved in shared memory environments is that of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Shared memory is widely regarded as a more intuitive model than message passing for the development of parallel programs. A shared memory model can be provided by hardware, software, or some combination of both. One of the most important problems to be solved in shared memory environments is that of cache coherence. Experience indicates, unsurprisingly, that hardware-coherent multiprocessors greatly outperform distributed sharedmemory (DSM) emulations on message-passinghardware. Intermediate options, however, have received considerably less attention. We argue in this position paper that one such option---a multiprocessor or network that provides a global physical address space in which processors can make non-coherent accesses to remote memory without trapping into the kernel or interrupting remote processors---can provide most of the performance of hardware cache coherence at little more monetary or design cost than traditional DSM systems. To support this claim we have developed the...

