Results 1 -
8 of
8
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems
- IN PROCEEDINGS OF THE 1994 WINTER USENIX CONFERENCE
, 1994
"... TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Our obj ..."
Abstract
-
Cited by 465 (17 self)
- Add to MetaCart
TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Our objective is to determine the efficiency of a user-level DSM implementation on commercially available workstations and operating systems. We achieved good speedups on the 8-processor ATM network for Jacobi (7.4), TSP (7.2), Quicksort (6.3), and ILINK (5.7). For a slightly modified version of Water from the SPLASH benchmark suite, we achieved only moderate speedups (4.0) due to the high communication and synchronization rate. Speedups decline on the 10-Mbps Ethernet (5.5 for Jacobi, 6.5 for TSP, 4.2 for Quicksort, 5.1 for ILINK, and 2.1 for Water), reflecting the bandwidth limitations of the Ethernet. These results support the contention that, with suitable networking technology, DSM is a...
Lazy Release Consistency for Distributed Shared Memory
, 1995
"... A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The l ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The lazy release consistent protocols achieve this reduction in communication by piggybacking consistency information on top of existing synchronization transfers. Some of the protocols also improve performance by speculatively moving data. We evaluate the impact of these features by comparing the performance of a software DSM using lazy protocols with that of a DSM using previous eager protocols. We found that seven of our eight applications performed better on the lazy system, and four of the applications showed performance speedups of at least 18%. As part of this comparison, we show that the cost of executing the slightly more complex code of the lazy protocols is far less important than the ...
Software Write Detection for a Distributed Shared Memory
- IN PROCEEDINGS OF THE FIRST USENIX SYMPOSIUM ON OPERATING SYSTEM DESIGN AND IMPLEMENTATION
, 1994
"... Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected w ..."
Abstract
-
Cited by 93 (0 self)
- Add to MetaCart
Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be writtenmany times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors. In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler /runtime, run...
Improving the Performance of DSM Systems via Compiler Involvement
- In Proceedings of Supercomputing '94
, 1994
"... Distributed shared memory (DSM) systems provide an illusion of shared memory on distributed memory systems such as workstation networks and some parallel computers such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements to hardware, software, or a combination thereof. ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
Distributed shared memory (DSM) systems provide an illusion of shared memory on distributed memory systems such as workstation networks and some parallel computers such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements to hardware, software, or a combination thereof. On these systems, users can write programs using a shared memory style of programming instead of message passing which is tedious and error prone. Our experience with one such system, TreadMarks, has shown that a large class of applications do not perform well on these systems. TreadMarks is a software distributed shared memory system designed by Rice University researchers to run on networks of workstations and massively parallel computers. Due to the distributed nature of the memory system, shared memory synchronization primitives such as locks and barriers often cause significant amounts of communication. We have provided a set of powerful primitives that will alleviate the problems with...
Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors
, 1992
"... The need to achieve higher performance through greater degrees of parallelism necessitates distributing the memory throughout a multiprocessor system to reduce contention and increase scalability. Unfortunately, such Non-Uniform Memory Access time (NUMA) multiprocessors introduce complications for t ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The need to achieve higher performance through greater degrees of parallelism necessitates distributing the memory throughout a multiprocessor system to reduce contention and increase scalability. Unfortunately, such Non-Uniform Memory Access time (NUMA) multiprocessors introduce complications for the programmers, who must now be concerned with the physical distribution of their data in order to extract good performance from the system. The impact of remote memory accesses can be reduced through replication and migration, either in processor caches or in main memory. Unfortunately, the effectiveness of caches is limited for large data sets due to capacity misses, while dynamic virtual memory page management suffers from a mismatch between the pages being replicated and the data structures in programs. In this thesis we propose that data be partitioned into Shared Regions reflecting the granularity of data sharing in programs, and that special synchronization calls be added to enforce...
Improving the Compiler/Software DSM Interface: Preliminary Results
- in Proceedings of the First SUIF Compiler Workshop
, 1996
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results show simply combining the parallelizer and software DSM yields very poor performance. The compiler/software DSM interface can be improved based on relatively little compiler input by: 1) combining synchronization and parallelism information communication on parallel task invocation, 2) employing customized routines for evaluating reduction operations, and 3) selecting a hybrid update protocol to presend data by flushing updates at barriers. These optimizations yield decent speedups for program kernels, but are not sufficient for entire programs. Based on our experimental results, we point out areas where additional compiler analysis and software DSM improvements are necessary to achieve goo...
Region-Oriented Memory Management in Shared-Memory Multiprocessors
, 1992
"... Effective caching and memory locality are essential for good performance of parallel applications on non-uniform memory access (NUMA) shared-memory multiprocessors. The need for multi-level memory coherence requires that consistency of shared data be maintained both within each level and across adja ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Effective caching and memory locality are essential for good performance of parallel applications on non-uniform memory access (NUMA) shared-memory multiprocessors. The need for multi-level memory coherence requires that consistency of shared data be maintained both within each level and across adjacent levels of the memory hierarchy. Existing static program analysis techniques for managing shared data at the cache level are limited because dynamic program data sharing behaviour is difficult to determine at compile time. Traditional schemes for managing shared data at the main memory level are limited because they manage shared data at page granularity and without knowledge of application behaviour. We introduce the notion of shared regions as the natural unit of data sharing in parallel applications, and show that this concept may be used in effective cache and main memory management. Algorithms for region-oriented cache and main memory management are presented. Region-oriented cache ...
A Framework for Multiprocessor Performance Characterization and Calibration
, 1992
"... A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache misses and non-local memory access latencies), inter-process synchronization operations, and global barrier synchronizations. Performance measurements to quantify the rate at which communication costs for an algorithm increases as more processors are used is integral to the study of an algorithm's efficiency and scalability. In this thesis, we explore the problem of performance characterization of a multiprocessor in the context of the shared-variable programming model with emphasis on characterizing the dynamic run-time behavior. We have developed a hierarchical model to characterize multiprocessor system performance using a multi-phase computation structure with concurrent asynchronous exec...

