Results 1 - 10
of
123
Implementation and performance of Munin
- IN PROCEEDINGS OF THE 13TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES
, 1991
"... Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, sha ..."
Abstract
-
Cited by 523 (23 self)
- Add to MetaCart
Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, shared program variables are annotated with their expected access pattern, and these annotations are then used by the runtime system to choose a consistency protocol best suited to that access pattern. Release consistency allows Munin to mask network latency and reduce the number of messages required to keep memory consistent. Munin's multiprotocol release consistency is implemented in software using a delayed update queue that buffers and merges pending outgoing writes. A sixteen-processor prototype of Munin is currently operational. We evaluate its implementation and describe the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs. Munin achieves this level of performance with only minor annotations to the shared memory programs.
Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology
"... We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidat ..."
Abstract
-
Cited by 413 (43 self)
- Add to MetaCart
We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate. Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved. While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that future work on software DSMs should concentrate on reducing the amount ofsynchronization or its effect.
Scope Consistency : A Bridge between Release Consistency and Entry Consistency
- In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1996
"... The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy R ..."
Abstract
-
Cited by 135 (12 self)
- Add to MetaCart
The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy Release Consistency (LRC) are accepted to offer a reasonable tradeoff between performance and programming complexity. Entry Consistency (EC) offers a more relaxed consistency model, but it requires explicit association of shared data objects with synchronization variables. The programming burden of providing such associations can be substantial. This paper proposes a new consistency model for shared virtual memory, called Scope Consistency (ScC), which offers most of the potential performance advantages of the EC model without requiring explicit bindings between data and synchronization variables. Instead, ScC dynamically detects the bindings implied by the programmer allowing a programming i...
Adaptive Cache Coherency for Detecting Migratory Shared Data
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically i ..."
Abstract
-
Cited by 112 (3 self)
- Add to MetaCart
Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of i...
Memory Consistency Models
, 1993
"... This paper discusses memory consistency models and their influence on software in the context of parallel machines. In the first part we review previous work on memory consistency models. The second part discusses the issues that arise due to weakening memory consistency. We are especially intereste ..."
Abstract
-
Cited by 98 (0 self)
- Add to MetaCart
This paper discusses memory consistency models and their influence on software in the context of parallel machines. In the first part we review previous work on memory consistency models. The second part discusses the issues that arise due to weakening memory consistency. We are especially interested in the influence that weakened consistency models have on language, compiler, and runtime system design. We conclude that tighter interaction between those parts and the memory system might improve performance considerably.
Object Distribution in Orca using Compile-Time and Run-Time Techniques
, 1993
"... Orca is a language for parallel programming on distributed systems. Communication in Orca is based on shared data-objects, which is a form of distributed shared memory. The performance of Orca programs depends strongly on how shared dataobjects are distributed among the local physical memories of th ..."
Abstract
-
Cited by 77 (20 self)
- Add to MetaCart
Orca is a language for parallel programming on distributed systems. Communication in Orca is based on shared data-objects, which is a form of distributed shared memory. The performance of Orca programs depends strongly on how shared dataobjects are distributed among the local physical memories of the processors. This paper studies a new and efficient solution to this problem, based on an integration of compile-time and run-time techniques. The Orca compiler has been extended to determine the access patterns of processes to shared objects. The compiler passes a summary of this information to the run-time system, which uses it to make good decisions about which objects to replicate and where to store nonreplicated objects. Measurements show that the new system gives better overall performance than any previous implementation of Orca. 3333333333333333 1 This research was supported in part by a PIONIER grant from the Netherlands Organization for Scientific Research (N.W.O.). 2 This re...
SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a vi ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters. To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight
MGS: A multigrain shared memory system
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory syste ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systemsDistributed Scalable Shared-memory Multiprocessors (DSSMPs). This paper introduces the design of a shared memory system thatuses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain sharedmemory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multi-grain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharingonly when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced. Using MGS, an in-depth study of several shared memory ap-plications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploitmultigrain sharing. Keeping the number of processors fixed, applications execute up to 85 % faster when each DSSMP node is amultiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance ad-vantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applicationsallows a DSSMP to almost match the performance of a tightlycoupled multiprocessor. 1 Introduction Large-scale shared memory multiprocessors have traditionally beenbuilt using custom communication interfaces, high performance VLSI networks, and special-purpose hardware support for sharedmemory. These systems achieve good performance on a wide range of applications; however, they are costly. Despite attempts to makecost (in addition to performance) scalable, fundamental obstacles prevent large tightly-coupled systems from being cost effective.Power distribution, clock distribution, cooling, and other packagAppears in Proceedings of the 23rd Annual InternationalSymposium on Computer Architecture, May 1996.
Message Passing Versus Distributed Shared Memory on Networks of Workstations
- IN PROCEEDINGS SUPERCOMPUTING '95
, 1995
"... We compare two paradigms for parallel programming on networks of workstations: message passing and distributed shared memory. We present results for nine applications that were implemented using both paradigms. The message passing programs are executed with the Parallel Virtual Machine (PVM) library ..."
Abstract
-
Cited by 52 (10 self)
- Add to MetaCart
We compare two paradigms for parallel programming on networks of workstations: message passing and distributed shared memory. We present results for nine applications that were implemented using both paradigms. The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the shared memory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks i...
Home-based Shared Virtual Memory
, 1998
"... In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when pe ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when performance-enhancing techniques from all these areas are combined. This dissertation proposes home-based lazy release consistency as a simple, effective, and scalable way to build shared virtual memory systems. In home-based protocols each shared page has a home to which all writes are propagated and from which all copies are derived. Two home-based protocols are described, implemented and evaluated on two hardware and software platforms: Automatic Update Release Consistency (AURC), which requires hardware support for fine-grained remote writes (automatic updates), and Homebased Lazy Release Consistency (HLRC), which is implemented exclusively in software. The dissertation investigates the ...

