Results 1 - 10
of
36
Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology
"... We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidat ..."
Abstract
-
Cited by 413 (43 self)
- Add to MetaCart
We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate. Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved. While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that future work on software DSMs should concentrate on reducing the amount ofsynchronization or its effect.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors
, 1997
"... This paper describes RSIM -- the Rice Simulator for ILP Multiprocessors -- Version 1.0. RSIM simulates shared-memory multiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level parallelism (ILP). RSIM is execution-driven and models state-of-the-art ILP proc ..."
Abstract
-
Cited by 69 (4 self)
- Add to MetaCart
This paper describes RSIM -- the Rice Simulator for ILP Multiprocessors -- Version 1.0. RSIM simulates shared-memory multiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level parallelism (ILP). RSIM is execution-driven and models state-of-the-art ILP processors, an aggressive memory system, and a multiprocessor coherence protocol and interconnect, including contention at all resources. Although originally designed as a research tool, RSIM is also being used successfully in both undergraduate and graduate computer architecture courses at Rice University. RSIM version 1.0 is publicly available. 1 Introduction This paper describes RSIM -- the Rice Simulator for ILP Multiprocessors -- Version 1.0. RSIM is primarily designed to study shared-memory multiprocessor architectures built from processors that aggressively exploit instruction-level parallelism (ILP). It models state-of-the-art ILP processors, an aggressive memory system, and a multipr...
Software Versus Hardware Shared-Memory Implementation: A Case Study
- In Proceedings of the 21st Annual International Symposium on Computer Architecture
, 1994
"... We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMark ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal di erence between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the di erence in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480. Beyond eight processors, our results are based on execution-driven simulation. Speci cally, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communi-
Diffracting trees
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract
-
Cited by 52 (10 self)
- Add to MetaCart
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed shared-memory machine and several simulated message-passing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queue-locking in the design of many concurrent data structures.
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors
, 1997
"... The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consisten ..."
Abstract
-
Cited by 43 (13 self)
- Add to MetaCart
The memory consistency model of a shared-memory multiprocessor determines the extent to which memory operations may be overlapped or reordered for better performance. Studies on previous-generation shared-memory multiprocessors have shown that relaxed memory consistency models like release consistency (RC) can significantly outperform the conceptually simpler model of sequential consistency (SC). Current and next-generation multiprocessors use commodity microprocessors that aggressively exploit instruction-level parallelism (ILP) using methods such as multiple issue, dynamic scheduling, and non-blocking reads. For such processors, researchers have conjectured that two techniques, hardware-controlled non-binding prefetching and speculative reads, have the potential to equalize the hardware performance of memory consistency models. These techniques have recently begun to appear in commercial microprocessors, and re-open the question of whether the performance benefits of release consiste...
Parallelized direct execution simulation of message-passing parallel programs
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... As massively parallel computers proliferate, there is growing interest in �nding ways by which performance of massively parallel codes can be e�ciently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm developm ..."
Abstract
-
Cited by 42 (10 self)
- Add to MetaCart
As massively parallel computers proliferate, there is growing interest in �nding ways by which performance of massively parallel codes can be e�ciently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, speci�cally the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE �Large Application Parallel Simulation Environment�, wehave built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10 � relative error. Depending on the nature of the application code, we have observed low slowdowns �relative to natively executing code � and high relative speedups using up to 64 processors.
A Distributed Memory LAPSE: Parallel Simulation of Message-Passing Programs
- In Workshop on Parallel and Distributed Simulation
, 1993
"... As massively parallel computers become increasingly available, users' interest in the scalability of their parallel codes is growing. However, such computers are in high demand, and access to them is restricted. In this paper we develop techniques that allow one to use a small number of parallel ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
As massively parallel computers become increasingly available, users' interest in the scalability of their parallel codes is growing. However, such computers are in high demand, and access to them is restricted. In this paper we develop techniques that allow one to use a small number of parallel processors to simulate the behavior of a message-passing code running on a large number of processors. Such methods will allow a user to performance tune a code for massive parallelism before actually using large numbers of processors. Other potential applications include parallel simulation of large distributed systems driven by directly executing application code, and high-performance simulations of network designs driven by directly executing application code. We distribute simulation processes and application processes among multitasking processors. Calls by application processes to message-passing routines are trapped by simulator processes, which are responsible for scheduling a...
MPI-SIM: Using Parallel Simulation To Evaluate MPI Programs
, 1998
"... This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of existing MPI programs as a function of architectural characteristics including number of processors and message communication latencies. The simulation models can be executed sequentially or in parallel. Parallel executions of MPI-SIM models are synchronized using a set of asynchronous conservative protocols. MPI-SIM reduces synchronization overheads by exploiting the communication characteristics of the program that it simulates. The paper presents validation and performance results from the use of MPI-SIM to simulate applications from the NAS Parallel Benchmark suite. Using the techniques described in this paper, we were able to reduce the number of synchronizations in the parallel simula...
Execution-Driven Simulation of Multiprocessors: Address and Timing Analysis
- ACM Transactions on Modeling and Computer Simulation
, 1994
"... This paper describes and evaluates an efficient execution-driven technique for the simulation of multiprocessors that includes the simulation of system memory and is driven by real program workloads. The technique produces correctly interleaved address traces at run time without disk access overhead ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper describes and evaluates an efficient execution-driven technique for the simulation of multiprocessors that includes the simulation of system memory and is driven by real program workloads. The technique produces correctly interleaved address traces at run time without disk access overhead or hardware support, allowing accurate simulation of the effects of a variety of architectural alternatives on programs. We have implemented a simulator based on this technique that offers substantial advantages in terms of reduced time and space overheads when compared to instruction-driven or trace-driven simulation techniques, without significant loss of accuracy. The paper presents the results of several validation experiments used to quantify the accuracy and efficiency of the simulator for sequential, distributed, and shared-memory multiprocessors, and several parallel programs. These experiments show that prediction errors of less than 5% compared to actual execution times and overhe...

