Results 1 - 10
of
24
Shasta: A Low Overhead, Software-Only Approach . . . .
- IN PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1996
"... This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granu ..."
Abstract
-
Cited by 207 (5 self)
- Add to MetaCart
This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache co...
Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors
- In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming
, 1997
"... The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal or more commodity-based communication architec ..."
Abstract
-
Cited by 58 (17 self)
- Add to MetaCart
The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal or more commodity-based communication architectures. This paper studies this issue of performance portability, with the commodity communication architecture of interest being page-grained shared virtual memory. We begin with applications that perform well on moderate-scale hardware cache-coherent systems, and find that they do not do so well on SVM systems. Then, we examine whether and how the applications can be improved for SVM systems ---through data structuring or algorithmic enhancements---and the nature and difficulty of the optimizations. Finally, we examine the impact of the successful optimizations on hardware-coherent platforms themselves, to see whether they are helpful, harmful or neutral on those platforms. We develop a sys...
Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation
- In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming
, 1997
"... During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, are not well understood. This paper studies these tradeoff ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, are not well understood. This paper studies these tradeoffs on a platform that provides access control in hardware but runs coherence protocols in software. We compare the performance of three protocols across four coherence granularities, using 12 applications on a 16-node cluster of workstations. Our results show that no single combination of protocol and granularity performs best for all the applications. The combination of a sequentially consistent (SC) protocol and fine granularity works well with 7 of the 12 applications. The combination of a multiple-writer, homebased lazy release consistency (HLRC) protocol and page granularity works well with 8 out of the 12 applications. For applications that suer performance losses in moving to coarser granularity under sequential consistency, the performance can usually be regained quite effectively using relaxed protocols, particularly HLRC. We also find that the HLRC protocol performs substantially better than a single-writer lazy release consistent (SW-LRC) protocol at coarse granularity for many irregular applications. For our applications and platform, when we use the original versions of the applications ported directly from hardware-coherent shared memory, we nd that the SC protocol with 256-byte granularity performs best on average. However, when the best versions of the applications are compared, the balance shifts in favor of HLRC at page granularity.
Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems
- In Proceedings of the 26th International Symposium on Computer Architecture
, 1999
"... The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper sho ..."
Abstract
-
Cited by 41 (7 self)
- Add to MetaCart
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechan...
Fine-Grain Software Distributed Shared Memory on SMP Clusters
, 1997
"... Commercial SMP nodes are an attractive building block for software distributed shared memory systems. The advantages of using SMP nodes include fast communication among processors within the same SMP node and potential gains from clustering where remote data fetched by one processor is used by o ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
Commercial SMP nodes are an attractive building block for software distributed shared memory systems. The advantages of using SMP nodes include fast communication among processors within the same SMP node and potential gains from clustering where remote data fetched by one processor is used by other processors on the same node. The widespread availability of SMP servers with small numbers of processors has led several researchers to consider their use as building blocks for Shared Virtual Memory (SVM) systems. These systems exploit the SMP cache-coherence hardware to support fine-grain communication within a node, and use software to support communication across nodes at a coarser page-size granularity. Our goal is to explore the use of SMP nodes in the context of the Shasta system. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. Shasta implements this coherence by i...
Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory
- IN PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES
, 1998
"... Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a netw ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a network of low-cost SMPs. Sirocco takes full advantage of SMP nodes by implementing inter-node sharing directly in hardware and overlapping computation with protocol execution. To maintain correct shared-memory semantics, however, SMP nodes require mechanisms to guarantee atomic coherence operations. Multiple SMP processors may also result in contention for shared resources and reduce performance. SMP nodes also impact the cost trade-off. While SMPs typically charge higher price-premiums, for a given system size SMP nodes substantially reduce networking hardware requirement as compared to uniprocessor nodes. In this paper
Address Translation Mechanisms in Network Interfaces
, 1998
"... Good network hardware performance is often squandered by overheads for accessing the network interface (NI) within a host. NIs that support user-level messaging avoid frequent operating system (OS) action yet unnecessary copying can still result in low performance. We explore improving application m ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Good network hardware performance is often squandered by overheads for accessing the network interface (NI) within a host. NIs that support user-level messaging avoid frequent operating system (OS) action yet unnecessary copying can still result in low performance. We explore improving application messaging performance by eliminating all unnecessary copies (minimal messaging). For minimal messaging, NIs must support address translation and must do so more richly than has been done in the past. NI address translation should flexibly support higher-level abstractions, map all user space, exploit translation locality, and degrade gracefully when locality is poor. We classify NI address translation implementations based on where the lookup and the miss handling are performed (CPU or NI). We present alternative designs and we consider how they interact with the OS. We provide simulation results that evaluate the alternative design points and we demonstrate feasibility with a real implement...
Removing the Overhead from Software-Based Shared Memory
, 2001
"... The implementation presented in this paper---DSZOOM-WF--- is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest imple ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The implementation presented in this paper---DSZOOM-WF--- is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds. The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. All interrupt- and/or poll-based asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory. DSZOOM-WF consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.
Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization
- PhD thesis,University of Wisconsin,Madison,1999
, 1999
"... Efficient locking synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of locking primitives is to enable exclusive access to shared data and critical sections of code. In this dissertation, I make the following six cont ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Efficient locking synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of locking primitives is to enable exclusive access to shared data and critical sections of code. In this dissertation, I make the following six contributions. (1) I propose a framework, the synchronization period, in which to reason about the inefficiencies of locking primitives. (2) I identify four previously proposed locking mechanisms (local spinning, queue-based locking, collocation, and synchronous prefetch) and uses them to classify existing locking primitives according to which of these mechanisms they incorporate. (3) With detailed simulations, I show the extent to which these four mechanisms can improve the performance of sharedmemory programs. I evaluate the space of these mechanisms using sixteen synchronization constructs, which are formed from six base types of locks (test&set, test&test&set, MCS, LH, M, and QOLB). I show t...
Mechanisms for Distributed Shared Memory
, 1996
"... Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resource ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resources more efficiently. This thesis proposes a new approach that lets users efficiently manage communication and memory on DSM systems. Systems provide primitive DSM mechanisms without binding them to fixed protocols (policies). Standard shared-memory programs use default protocols similar to those found in current DSM machines. Unlike current systems, these protocols are implemented in unprivileged software. Programmers and compilers are free to modify or replace them with optimized custom protocols that manage memory and communication directly and efficiently. To explore this new approach, this thesis: . identifies a set of mechanisms for distributed shared memory, . develops Tempest, a portab...

