Results 1 - 10
of
26
Performance Evaluation of the Orca Shared Object System
- ACM Transactions on Computer Systems
, 1998
"... Orca is a portable, object-based distributed shared memory system. This paper studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The paper gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the ..."
Abstract
-
Cited by 63 (42 self)
- Add to MetaCart
Orca is a portable, object-based distributed shared memory system. This paper studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The paper gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally-ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for ten parallel applications illustrate the tradeoffs made in the design of Orca, and also show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally-ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the...
OpenMP for Networks of SMPs
"... In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a transl ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water, NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30 % better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30 % of the MPI versions.
JIAJIA: An SVM System Based on a New Cache Coherence Protocol
- 980001, CENTER OF HIGH PERFORMANCE COMPUTING , INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES
, 1998
"... This paper describes design and evaluation of a Shared Virtual Memory (SVM) system called JIAJIA. The objective of JIAJIA is two-fold: extending memory space and improving performance. Based on the observation that system overhead of a complex SVM system may effectively offset performance gains caus ..."
Abstract
-
Cited by 37 (12 self)
- Add to MetaCart
This paper describes design and evaluation of a Shared Virtual Memory (SVM) system called JIAJIA. The objective of JIAJIA is two-fold: extending memory space and improving performance. Based on the observation that system overhead of a complex SVM system may effectively offset performance gains caused by the additional complexity, JIAJIA is designed as simple as possible. Compared to recent SVM systems such as TreadMarks, CVM, and Quarks, JIAJIA characterizes itself in the following aspects: ffl Physical memories of multiple computers are combined to form a larger shared space. In other recent SVM systems such as TreadMarks, CVM, and Quarks, the shared address space is limited by the size of the memory of a computer. In JIAJIA, the size of shared space can be as large as the sum of each machine's local memories. ffl A lock-based cache coherence protocol for scope consistency is proposed to simplify the design. Our protocol is lock-based because it totally eliminates directory and all ...
Tradeoffs between False Sharing and Aggregation in Software Distributed Shared Memory
- PROCEEDINGS OF THE SIXTH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING
, 1997
"... Software Distributed Shared Memory (DSM) systems based on virtual memory techniques traditionally use the hardware page as the consistency unit. The large size of the hardware page is considered to be a performance bottleneck because of the implied false sharing overheads. Instead, we show that in t ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
Software Distributed Shared Memory (DSM) systems based on virtual memory techniques traditionally use the hardware page as the consistency unit. The large size of the hardware page is considered to be a performance bottleneck because of the implied false sharing overheads. Instead, we show that in the presence of a relaxed consistency model and a multiple writer protocol, a large consistency unit is generally not detrimental to performance. We study the tradeoffs between false sharing and aggregation effects when using large consistency units. In this context, this paper makes three separate contributions: 1. We document the cost of false sharing in terms of extra messages and extra data being communicated. We find that, for the applications considered, when the virtual memory page is used as the consistency unit, the number of extra messages is small, while the amount of extra data can be substantial. 2. We evaluate the performance when the consistency unit is increased to a multiple of the virtual memory page size. For most applications and data sets, the
performance improves, except when the false sharing
effects include extra messages or a large amount of extra
data.
3. We present a new algorithm for dynamically aggregating
pages. In our algorithm, the aggregated pages do
not necessarily need to be contiguous. In all cases, the
performance of our dynamic aggregation algorithm is
similar to that achieved with the best static page size.
These results were obtained by measuring the performance
of eight applications on the TreadMarks distributed
shared memory system. The hardware platform used is
a network of 166Mhz Pentiums connected by a switched
100Mbps Ethernet network.
Performance evaluation of the orca shared-object system
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca’s coherence protocol (based on write-updates with function shipp ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca’s coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that the Orca programs
OpenMP on Networks of Workstations
, 1998
"... We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In orde ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In order to support OpenMP execution on NOWs, our compiler targets a software distributed shared memory system (DSM) which provides multi-threaded execution and memory consistency. This paper presents two contributions. First, we identify two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggest simple modifications to the standard that remedy the situation. These problems reflect differences in memory architecture between software and hardware shared memory and the high cost of synchronization on NOWs. Second, we present performance results of a prototype implementation of an OpenMP subset on a NOW, and compare them with hand-coded software DSM and MP...
Adaptive Protocols for Software Distributed Shared Memory
, 1999
"... We demonstrate the benefits of software shared memory protocols that adapt at run-time to the memory access patterns observed in the applications. This adaptation is automatic --- no user annotations are required --- and does not rely on compiler support or special hardware. We investigate adaptatio ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
We demonstrate the benefits of software shared memory protocols that adapt at run-time to the memory access patterns observed in the applications. This adaptation is automatic --- no user annotations are required --- and does not rely on compiler support or special hardware. We investigate adaptation between single- and multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation between invalidate and update. Our results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial. The results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration. I. Introduction Many different protocols have been proposed for implementing a software shared memory abstraction on distributed memory hardware. The relative performance of these protocols is application-dependent: the m...
Home Migration in Home-Based Software DSMs
- IN FIRST WORKSHOP ON SOFTWARE DISTRIBUTED SHARED MEMORY
, 1999
"... Home-based software DSMs provide a simple, effective, and scalable way to build software DSMs. However, the performance of home-based software DSMs is sensitive to the distribution of home pages. This paper introduces our work on migrating home pages adaptively according to the application sharing p ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
Home-based software DSMs provide a simple, effective, and scalable way to build software DSMs. However, the performance of home-based software DSMs is sensitive to the distribution of home pages. This paper introduces our work on migrating home pages adaptively according to the application sharing pattern in a home-based software DSM system called JIAJIA. In the scheme, pages that are written by only one processor between two barriers are migrated to the single writing processor. Migration messages are piggybacked on barrier messages and no additional communication is required for the migration. Though very simple, performance evaluation with SPLASH program suite and NAS Parallel Benchmarks shows that home migration can reduce diffs dramatically and performance gains obtained by home migration arranges from several to hundreds percent compared to statically distributing home of shared data page-by-page across processors.
A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000
- INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING
, 1999
"... We compare the performance of three major programming models -- a load-store cache-coherent shared address space (CC-SAS), message passing (MP) and the segmented SHMEM model -- on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-perform ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We compare the performance of three major programming models -- a load-store cache-coherent shared address space (CC-SAS), message passing (MP) and the segmented SHMEM model -- on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular and predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by mo...
Experiences using OpenMP based on Compiler Directed Software DSM On a PC Cluster
, 2002
"... In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) memory system. We compare the performance of message passing implementations of a subset of the NAS Parallel Benchmarks with their OpenMP counterpart and ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this work we report on our experiences running OpenMP programs on a commodity cluster of PCs running a software distributed shared memory (DSM) memory system. We compare the performance of message passing implementations of a subset of the NAS Parallel Benchmarks with their OpenMP counterpart and quantify the difference in performance in terms of remote and local memory access and synchronization time

