Results 1 - 10
of
46
Performance Evaluation of the Orca Shared Object System
- ACM Transactions on Computer Systems
, 1998
"... Orca is a portable, object-based distributed shared memory system. This paper studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The paper gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), ..."
Abstract
-
Cited by 61 (42 self)
- Add to MetaCart
(Show Context)
Orca is a portable, object-based distributed shared memory system. This paper studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The paper gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally-ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for ten parallel applications illustrate the tradeoffs made in the design of Orca, and also show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally-ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the...
OpenMP on Networks of Workstations
, 1998
"... We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In orde ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In order to support OpenMP execution on NOWs, our compiler targets a software distributed shared memory system (DSM) which provides multi-threaded execution and memory consistency. This paper presents two contributions. First, we identify two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggest simple modifications to the standard that remedy the situation. These problems reflect differences in memory architecture between software and hardware shared memory and the high cost of synchronization on NOWs. Second, we present performance results of a prototype implementation of an OpenMP subset on a NOW, and compare them with hand-coded software DSM and MP...
Performance evaluation of the orca shared-object system
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca’s coherence protocol (based on write-updates with function shipp ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca’s coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that the Orca programs
Eliminating Conflict Misses for High Performance Architectures
- In Proceedings of the 1998 ACM International Conference on Supercomputing
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. Two data-layout transformations, inter- and intra-variable padding, can eliminate many conflict misses at compile time. We present GroupPad, an inter-variable padding heuristic to preserve group reuse ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
(Show Context)
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. Two data-layout transformations, inter- and intra-variable padding, can eliminate many conflict misses at compile time. We present GroupPad, an inter-variable padding heuristic to preserve group reuse in stencil computations frequently found in scientific computations. We show padding can also improve performance in parallel programs. Our optimizations have been implemented and tested on a collection of kernels and programs for different cache and data sizes. Preliminary results demonstrate GroupPad is able to consistently preserve group reuse among the programs evaluated, though execution time improvements are small for actual problem and cache sizes tested. Padding improves performance of parallel versions of programs approximately the same magnitude as sequential versions of the same program. 1 Introduction Effectively exploiting caches is widely regarded as the key to achieving good...
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers
, 1997
"... In this paper, we evaluate the use of software distributed shared memory (DSM) on a message passing machine as the target for a parallelizing compiler. We compare this approach to compiler-generated message passing, hand-coded software DSM, and hand-coded message passing. For this comparison, we use ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
In this paper, we evaluate the use of software distributed shared memory (DSM) on a message passing machine as the target for a parallelizing compiler. We compare this approach to compiler-generated message passing, hand-coded software DSM, and hand-coded message passing. For this comparison, we use six applications: four that are regular and two that are irregular. Our results are gathered on an 8-node IBM SP/2 using the TreadMarks software DSM system. We use the APR shared-memory (SPF) compiler to generate the shared memory programs, and the APRXHPF compiler to generate message passing programs. The hand-coded message passing programs run with the IBM PVMe optimized message passing library. On the regular programs, both the compilergenerated and the hand-coded message passing outperform the SPF/TreadMarks combination: the compiler-generated message passing by 5.5% to 40%, and the hand-coded message passing by 7.5% to 49%. On the irregular programs, the SPF/TreadMarks combination outp...
Improving Compiler and Run-Time Support for Adaptive Irregular Codes
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
(Show Context)
Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50--150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity. 1 Introduction Scientists are beginning to exploit parallelism to provide the computing power they need for research and development. As they attempt to model more complex problems, irregular adaptive computations become increasingly important. The core of these applications is fre...
On the Design and Implementation of DSM-Threads
- In Proc. 1997 International Conference on Parallel and Distributed Processing Techniques and Applications
, 1997
"... This paper discusses design goals, design decisions, and implementation choices of DSM-Threads, a runtime system to support distributed threads with a distributed shared virtual memory (DSM). DSM-Threads provides a distributed runtime system with a kernel on each node, which relies on POSIX threads ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
(Show Context)
This paper discusses design goals, design decisions, and implementation choices of DSM-Threads, a runtime system to support distributed threads with a distributed shared virtual memory (DSM). DSM-Threads provides a distributed runtime system with a kernel on each node, which relies on POSIX threads locally and a decentralized communication subsystem between nodes. Support for multiple data consistency protocols facilitates the migration from shared-memory POSIX threads to DSM-Threads in a distributed environment on one side and offers opportunities to finetune the program for DSM-Threads on the other side. The overall approach enhances portability of the system and allows support for heterogeneous environments without modifications of compilers or operating systems. The paper also describes the support for higher-order distributed language features by example for Ada95. Finally, a first evaluation of the system's performance is given. DSMThreads is, to our knowledge, the first runtime ...
Using Multicast and Multithreading to Reduce Communication in Software DSM Systems
, 1998
"... This paper examines the performance benefits of employing multicast communication and application-level multithreading in the Brazos software distributed shared memory (DSM) system. Application-level multithreading in Brazos allows programs to transparently take advantage of available local multipro ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
This paper examines the performance benefits of employing multicast communication and application-level multithreading in the Brazos software distributed shared memory (DSM) system. Application-level multithreading in Brazos allows programs to transparently take advantage of available local multiprocessing. Brazos uses multicast communication to reduce the number of consistency-related messages, and employs two adaptive mechanisms that reduce the detrimental side effects of using multicast communication. We compare three software DSM systems running on identical hardware: (1) a single-threaded point-to-point system, (2) a multithreaded point-to-point system, and (3) Brazos, which incorporates both multithreading and multicast communication. For the six applications studied, multicast and multithreading improve speedup on eight processors by an average of 38%.
Update Protocols and Iterative Scientific Applications
- In Proceedings of the International Parallel Processing Symposium
, 1998
"... Software DSMs have been a research topic for over a decade. While good performance has been achieved in some cases, consistent performance has continued to elude researchers. This paper investigates the performance of DSM protocols running highly regular scientific applications. Such applications sh ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Software DSMs have been a research topic for over a decade. While good performance has been achieved in some cases, consistent performance has continued to elude researchers. This paper investigates the performance of DSM protocols running highly regular scientific applications. Such applications should be ideal targets for DSM research because past behavior gives complete, or nearly complete, information about future behavior. We show that a modified home-based protocol can significantly outperform more general protocols in this application domain because of reduced protocol complexity. Nonetheless, such protocols still do not perform as well as expected. We show that the one of the major factors limiting performance is interaction with the operating system on page faults and page protection changes. We further optimize our protocol by completely eliminating such memory manipulation calls from the steady-state execution. Our resulting protocol improves average application performance by a further 34%, on top of the 19% improvement gained by our initial modification of the homebased protocol. 1.
Strings: A High-Performance Distributed Shared Memory for Symmetrical Multiprocessor Clusters
- in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing
, 1998
"... This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these ligh ..."
Abstract
-
Cited by 21 (11 self)
- Add to MetaCart
(Show Context)
This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these lightweight processes. Thus, Strings is designed to exploit data parallelism at the application level and task parallelism at the DSM system level. We show how using multiple kernel threads can improve the performance even in the presence of false sharing, using matrix multiplication as a case-study. We also show the performance results with benchmark programs from the SPLASH-2 suite [17]. Though similar work has been demonstrated with SoftFLASH [18], our implementation is completely in user space and thus more portable. Some other researach has studied the effect of clustering in SMPs suing simulations [19]. We have shown results from runs on an actual network of SMPs