Results 1 - 10
of
23
Scope Consistency : A Bridge between Release Consistency and Entry Consistency
- In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1996
"... The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy R ..."
Abstract
-
Cited by 135 (12 self)
- Add to MetaCart
The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy Release Consistency (LRC) are accepted to offer a reasonable tradeoff between performance and programming complexity. Entry Consistency (EC) offers a more relaxed consistency model, but it requires explicit association of shared data objects with synchronization variables. The programming burden of providing such associations can be substantial. This paper proposes a new consistency model for shared virtual memory, called Scope Consistency (ScC), which offers most of the potential performance advantages of the EC model without requiring explicit bindings between data and synchronization variables. Instead, ScC dynamically detects the bindings implied by the programmer allowing a programming i...
Understanding Application Performance on Shared Virtual Memory Systems
- In Proceedings of the 23rd Annual Symposium on Computer Architecture
, 1996
"... Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper b ..."
Abstract
-
Cited by 56 (20 self)
- Add to MetaCart
Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics. We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches--- Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less opt...
Home-based Shared Virtual Memory
, 1998
"... In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when pe ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when performance-enhancing techniques from all these areas are combined. This dissertation proposes home-based lazy release consistency as a simple, effective, and scalable way to build shared virtual memory systems. In home-based protocols each shared page has a home to which all writes are propagated and from which all copies are derived. Two home-based protocols are described, implemented and evaluated on two hardware and software platforms: Automatic Update Release Consistency (AURC), which requires hardware support for fine-grained remote writes (automatic updates), and Homebased Lazy Release Consistency (HLRC), which is implemented exclusively in software. The dissertation investigates the ...
VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks
, 1997
"... Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems—Cashmere and TreadMa ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems—Cashmere and TreadMarks—on a 32-processor DEC Alpha cluster connected by a Memory Channel network. Both Cashmere and TreadMarks use virtual memory to maintain coherence on pages, and both use lazy, multi-writer release consistency. The systems differ dramatically, however, in the mechanisms used to track sharing information and to collect and merge concurrent updates to a page, with the result that Cashmere communicates much more frequently, and at a much finer grain. Our principal conclusion is that low-latency networks make DSM based on fine-grain communication competitive with more coarse-grain approaches,but that further hardware improvements will be needed before such systems can provide consistently superior performance. In our experiments, Cashmere scales slightly better than TreadMarks for applications with false sharing. At the same time, it is severely constrained by limitations of the current Memory Channel hardware. In general, performance is better for TreadMarks.
Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems
- In Proceedings of the 26th International Symposium on Computer Architecture
, 1999
"... The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper sho ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechan...
Shared Virtual Memory: Progress and Challenges
- Proceedings of the IEEE
, 1999
"... This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency model ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency models, protocol laziness, architectural support, and applications and application-driven research. A section of the paper is devoted to each category. The last section discusses other important emerging issues related to SVM: the alternative of fine-grained software coherence, hybrid protocols that implement software shared memory across multiple hardware-coherent multiprocessors, and scalability. The paper summarizes comparative performance results from the literature, discusses their limitations, places existing protocols in a framework based on laziness, and identifies the lessons learned so far and some key outstanding questions.
Runtime Optimizations for a Java DSM Implementation
- In Java Grande
, 2001
"... Jackal is a fine-grained distributed shared memory implementation of the Java programming language. Jackal implements Java’s memory model and allows multithreaded Java programs to run unmodified on distributed-memory systems. This paper focuses on Jackal’s runtime system, which implements a multiple ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
Jackal is a fine-grained distributed shared memory implementation of the Java programming language. Jackal implements Java’s memory model and allows multithreaded Java programs to run unmodified on distributed-memory systems. This paper focuses on Jackal’s runtime system, which implements a multiple-writer, home-based consistency protocol. Protocol actions are triggered by software access checks that Jackal’s compiler inserts before object and array references. We describe optimizations for Jackal’s runtime system, which mainly consist of discovering opportunities to dispense with flushing of cached data. We give performance results for different runtime optimizations, and compare their impact with the impact of one compiler optimization. We find that our runtime optimizations are necessary for good Jackal performance, but only in conjunction with the Jackal compiler optimizations described in [24]. As a yardstick, we compare the performance of Java applications run on Jackal with the performance of equivalent applications that use a fast implementation of Java’s Remote Method Invocation (RMI) instead of shared memory. 1.
Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance
"... As the workstation market moves form single processor to small-scale shared memory multiprocessors, it is very attractive to construct larger-scale multiprocessors by connecting widely available symmetric multiprocessors (SMPs) in a less tightly coupled way. Using a shared virtual memory (SVM) layer ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
As the workstation market moves form single processor to small-scale shared memory multiprocessors, it is very attractive to construct larger-scale multiprocessors by connecting widely available symmetric multiprocessors (SMPs) in a less tightly coupled way. Using a shared virtual memory (SVM) layer for this purpose preserves the shared memory programming abstraction across nodes. We explore the feasibility and performance implications of one such approach by extending the AURC (Automatic Update Release Consistency) protocol, used in the SHRIMP multicomputer, to connect hardware-coherent SMPs rather than uniprocessors. We describe the extended AURC protocol, and compare its performance with both the AURC uniprocessor node case as well as with an all-software Lazy Release Consistency (LRC) protocol extended for SMPs. We present results based on detailed simulations of two protocols (AURC and LRC) and two architectural configurations of a system with 16 processors; one with one processor per node (16 nodes) and one with four processors per node (4 nodes). We find that, unless the bandwidth of the network interface is increased, the network interface becomes the bottleneck in a clustered architecture especially for AURC. While a LRC protocol can benefit from the reduction in per processor communication in a clustered architecture, the write-through traffic in AURC increases significantly the communication demands per network interface. This causes more traffic contention and either prevents the performance of AURC from improving under SMP or hurts it severely for applications with significant communication requirements. Thus, while AURC performs better than LRC, for applications with high communication needs, the reverse may be true in clustered architectures. Among possi...
Improving the Performance of Shared Virtual Memory on System Area Networks
, 1998
"... As clusters of workstations, uniprocessor or symmetric multiprocessors (SMPs), become important platforms for parallel computing, there is increasing research interest in supporting the attractive, shared address space programming model across them in software. The reason is that it may provide succ ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
As clusters of workstations, uniprocessor or symmetric multiprocessors (SMPs), become important platforms for parallel computing, there is increasing research interest in supporting the attractive, shared address space programming model across them in software. The reason is that it may provide successful low--cost, high--performance alternatives to both tightly--coupled, hardware--coherent distributed shared memory machines and to scalable servers. In both these cases, the clusters are formed with o#--the--self, high--end PCs or workstations and system area networks that track technologies well. Given that a shared memory abstraction is an attractive programming model for this architecture, there has been a lot of research in fast communication on clusters connected with system area networks and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hardware--coherent shared memory is still not...
Removing the Overhead from Software-Based Shared Memory
, 2001
"... The implementation presented in this paper---DSZOOM-WF--- is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest imple ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The implementation presented in this paper---DSZOOM-WF--- is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. All interrupt- and/or poll-based asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.

