Results 1 - 10
of
70
Implementation and performance of Munin
- IN PROCEEDINGS OF THE 13TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES
, 1991
"... Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, sha ..."
Abstract
-
Cited by 523 (23 self)
- Add to MetaCart
Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, shared program variables are annotated with their expected access pattern, and these annotations are then used by the runtime system to choose a consistency protocol best suited to that access pattern. Release consistency allows Munin to mask network latency and reduce the number of messages required to keep memory consistent. Munin's multiprotocol release consistency is implemented in software using a delayed update queue that buffers and merges pending outgoing writes. A sixteen-processor prototype of Munin is currently operational. We evaluate its implementation and describe the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs. Munin achieves this level of performance with only minor annotations to the shared memory programs.
Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology
"... We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidat ..."
Abstract
-
Cited by 413 (43 self)
- Add to MetaCart
We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate. Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved. While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that future work on software DSMs should concentrate on reducing the amount ofsynchronization or its effect.
Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence
, 1990
"... We are developing Munin, a system that allows programs written for shared memory multiprocessors to be executed efficiently on distributed memory machines. Munin attempts to overcome the architectural limitations of shared memory machines, while maintaining their advantages in terms of ease of progr ..."
Abstract
-
Cited by 260 (15 self)
- Add to MetaCart
We are developing Munin, a system that allows programs written for shared memory multiprocessors to be executed efficiently on distributed memory machines. Munin attempts to overcome the architectural limitations of shared memory machines, while maintaining their advantages in terms of ease of programming. Our system is unique in its use of loosely coherent memory, based on the partial order specified by a shared memory parallel program, and in its use of type-specific memory coherence. Instead of a single memory coherence mechanism for all shared data objects, Munin employs several different mechanisms, each appropriate for a different class of shared data object. These type-specific mechanisms are part of a runtime system that accepts hints from the user or the compiler to determine the coherence mechanism to be used for each object. This paper focuses on the design and use of Munin's memory coherence mechanisms, and compares our approach to previous work in this area.
LimitLESS Directories: A Scalable Cache Coherence Scheme
, 1991
"... Caches enhance the performance of multiprocessors by reducing network tra#c and average memory access latency. However, cache-based systems must address the problem of cache coherence. We propose the LimitLESS directory protocol to solve this problem. The LimitLESS scheme uses a combination of hardw ..."
Abstract
-
Cited by 195 (24 self)
- Add to MetaCart
Caches enhance the performance of multiprocessors by reducing network tra#c and average memory access latency. However, cache-based systems must address the problem of cache coherence. We propose the LimitLESS directory protocol to solve this problem. The LimitLESS scheme uses a combination of hardware and software techniques to realize the performance of a full-map directory with the memory overhead of a limited directory. This protocol is supported by Alewife, a large-scale multiprocessor. We describe the architectural interfaces needed to implement the LimitLESS directory, and evaluate its performance through simulations of the Alewife machine.
Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors
"... The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, al ..."
Abstract
-
Cited by 146 (9 self)
- Add to MetaCart
The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, allowing very limited buffering, to release consistency on the other end, allowing extensive buffering and pipelining. The processor consistency and weak consistency models fall in between. The advantage of the less strict models is increased performance potential. The disadvantage is increased hardware complexity and a more complex programming model. To make an informed decision on the above tradeoff requires performance data for the various models. This paper addresses the issue of performance benefits from the above four consistency models. Our results are based on simulation studies done for three applications. The results show that in an environment where processor reads are blocking and writes are buffered, a significant performance increase is achieved from allowing reads to bypass previous writes. Pipelining of writes, which determines the rate at which writes are retired from the write buffer, is of secondary importance. As a result, we show that the sequential consistency model performs poorly relative to all other models, while the processor consistency model provides most of the benefits of the weak and release consistency models.
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
- In Proceedings of Workshop on Scalable Shared Memory Multiprocessors
, 1991
"... The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory arch ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory architecture, Alewife allows efficient shared-memory programming through a multilayered approach to locality management. A new scalable cache-coherence scheme called LimitLESS directories allows the use of caches for reducing communication latency and network bandwidth requirements. Alewife also employs run-time and compile-time methods for partitioning and placement of data and processes to enhance communication locality. While the above methods attempt to minimize communication latency, communication with distant processors cannot be completely avoided. Alewife's processor, Sparcle, is designed to tolerate these latencies by rapidly switching between threads of computation. This paper describe...
False Sharing and Spatial Locality in Multiprocessor Caches
- IEEE Transactions on Computers
, 1992
"... The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache mi ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in this paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We...
Adaptive Software Cache Management for Distributed Shared Memory Architectures
- In Proceedings of the 17th Annual International Symposium on Computer Architecture
, 1990
"... An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. We contend that, in distributed shared memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. We have examine ..."
Abstract
-
Cited by 112 (15 self)
- Add to MetaCart
An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. We contend that, in distributed shared memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. We have examined the sharing and synchronization behavior of a variety of shared memory parallel programs. We have found that the access patterns of a large percentage of shared data objects fall in a small number of categories for which efficient software coherence mechanisms exist. In addition, we have performed a simulation study that provides two examples of how an adaptive caching mechanism can take advantage of semantic information. 1 Introduction We are developing Munin [4], a system that will allow programs written for shared memorymultiprocessors to be executed efficiently on distributed memory machines. What distinguishes Munin from previous distributed shared memory systems [6, 12, 14] is the ...
Adaptive Cache Coherency for Detecting Migratory Shared Data
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically i ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of i...
Techniques for reducing consistency-related communication in distributed shared memory systems
- ACM Transactions on Computer Systems. Toappear
"... Distributed shared memory (DSM) is a software abstraction of shared memory on a distributed memory machine. One of the key problems in building an e cient DSM system is to reduce the amount ofcommunication needed to keep the distributed memories consistent. In this paper we presentfour techniques fo ..."
Abstract
-
Cited by 93 (13 self)
- Add to MetaCart
Distributed shared memory (DSM) is a software abstraction of shared memory on a distributed memory machine. One of the key problems in building an e cient DSM system is to reduce the amount ofcommunication needed to keep the distributed memories consistent. In this paper we presentfour techniques for doing so: (1) software release consistency � (2) multiple consistency protocols � (3) write-shared protocols � and (4) an update-with-timeout mechanism. These techniques have been implemented in the Munin DSM system. We compare the performance of seven application programs on Munin, rst to their performance when implemented using message passing and then to their performance when running on a conventional DSM system that does not embody the above techniques. On a 16-processor cluster of workstations, Munin's performance is within 5 % of message passing for four out of the seven applications. For the other three, performance is within 29 % to 33%. Detailed analysis of two of these three applications indicates that the addition of a function shipping capability would bring their performance within 7 % of the message passing performance. Compared to a conventional DSM system, Munin achieves performance improvements ranging from a few to several hundred percent, depending on the application.

