Results 1 - 10
of
25
Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence
, 1990
"... We are developing Munin, a system that allows programs written for shared memory multiprocessors to be executed efficiently on distributed memory machines. Munin attempts to overcome the architectural limitations of shared memory machines, while maintaining their advantages in terms of ease of progr ..."
Abstract
-
Cited by 261 (15 self)
- Add to MetaCart
We are developing Munin, a system that allows programs written for shared memory multiprocessors to be executed efficiently on distributed memory machines. Munin attempts to overcome the architectural limitations of shared memory machines, while maintaining their advantages in terms of ease of programming. Our system is unique in its use of loosely coherent memory, based on the partial order specified by a shared memory parallel program, and in its use of type-specific memory coherence. Instead of a single memory coherence mechanism for all shared data objects, Munin employs several different mechanisms, each appropriate for a different class of shared data object. These type-specific mechanisms are part of a runtime system that accepts hints from the user or the compiler to determine the coherence mechanism to be used for each object. This paper focuses on the design and use of Munin's memory coherence mechanisms, and compares our approach to previous work in this area.
Synthesis: An Efficient Implementation of Fundamental Operating System Services
, 1992
"... This dissertation shows that operating systems can provide fundamental services an order of magnitude more efficiently than traditional implementations. It describes the implementation of a new operating system kernel, Synthesis, that achieves this level of performance. The Synthesis kernel combines ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
This dissertation shows that operating systems can provide fundamental services an order of magnitude more efficiently than traditional implementations. It describes the implementation of a new operating system kernel, Synthesis, that achieves this level of performance. The Synthesis kernel combines several new techniques to provide high performance without sacrificing the expressive power or security of the system. The new ideas include: ffl Run-time code synthesis --- a systematic way of creating executable machine code at runtime to optimize frequently-used kernel routines --- queues, buffers, context switchers, interrupt handlers, and system call dispatchers --- for specific situations, greatly reducing their execution time. ffl Fine-grain scheduling --- a new process-scheduling technique based on the idea of feedback that performs frequent scheduling actions and policy adjustments (at submillisecond intervals) resulting in an adaptive, self-tuning system that can support real-ti...
Software Versus Hardware Shared-Memory Implementation: A Case Study
- In Proceedings of the 21st Annual International Symposium on Computer Architecture
, 1994
"... We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMark ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal di erence between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the di erence in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480. Beyond eight processors, our results are based on execution-driven simulation. Speci cally, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communi-
Comparison of Hardware and Software Cache Coherence Schemes
, 1991
"... We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system. Our goal is to identify the workloads for which either of the schemes is significantly better. Our methodology improves upon previous analytical studies a ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system. Our goal is to identify the workloads for which either of the schemes is significantly better. Our methodology improves upon previous analytical studies and complements previous simulation studies by developing a common high-level workload model that is used to derive separate sets of lowlevel workload parameters for the two schemes. This approach allows an equitable comparison of the two schemes for a specific workload. Our results show that software schemes are comparable (in terms of processor efficiency) to hardware schemes for a wide class of programs. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts, or when there are many writes in the program that are not executed at runtime. Fo...
Replication Techniques For Speeding Up Parallel Applications On Distributed Systems
, 1992
"... This paper discusses the design choices involved in replicating objects and their effect on performance. Important issues are: how to maintain consistency among different copies of an object; how to implement changes to objects; and which strategy for object replication to use. We have implemented s ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
This paper discusses the design choices involved in replicating objects and their effect on performance. Important issues are: how to maintain consistency among different copies of an object; how to implement changes to objects; and which strategy for object replication to use. We have implemented several options to determine which ones are most efficient.
Cache Coherence for Shared Memory Multiprocessors Based on Virtual Memory Support
, 1992
"... This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors and requires no special hardware to do so. Traditional VM translation hardware in each processor is used to detect memory access attempts that wou ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
This paper presents a software cache coherence scheme that uses virtual memory (VM) support to maintain cache coherency for shared memory multiprocessors and requires no special hardware to do so. Traditional VM translation hardware in each processor is used to detect memory access attempts that would violate cache coherence and system software is used to enforce coherence. The implementation of this class of coherence schemes is extremely economical: it requires neither special multiprocessor hardware nor compiler support, and easily incorporates different consistency models. We evaluated two consistency models for the VM-based approach: sequential consistency and lazy release consistency. The VM-based schemes are compared with a bus based snoopy caching architecture, and our trace-driven simulation results show that the VM-based cache coherence schemes are practical for small-scale, shared memory multiprocessors. Keywords: shared memory, multiprocessors, cache coherence, memory manag...
Multiprocessor Cache Coherence Based on Virtual Memory Support
, 1995
"... : Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approac ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
: Virtual memory based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtualmemory techniques. The key feature of the approach is that the virtual memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity against increased software overheads. The work presented in this paper evaluates this tradeoff. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth. ffl Keywords: shared memory, multiprocessors, cache coherence, virtual memory, performance evaluation. ffl Biographies: Karin Petersen is a Member of the Research Staff at Xe...
Techniques for Cache and Memory Simulation Using Address Reference Traces
- Int. J. Comput. Simul
, 1990
"... Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be divided into trace collection, trace storage, and trace usage. Trace collection can employ several hardware or software methods. Common concerns are that the collection method capture all of the address references of interest, that the execution overhead of the collection method is not excessive, and that the trace is of adequate length. The increasing size of caches heightens the adequate length concern. Trace storage is of concern because of the large size of traces. Techniques for trace compression and trace reduction have been developed. Trace usage is of concern because of the length of a simulation. Under some circumstances it is possible to evaluate multiple cache sizes in a si...
Hector - A Hierarchically Structured Shared Memory Multiprocessor
- IEEE Computer
, 1991
"... This paper describes the architecture of a multiprocessor, called Hector, which exploits current microprocessor technology to produce a machine with good cost/performance tradeoff. A key design feature of Hector is an interconnection backplane that scales well with technology. This is achieved with ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
This paper describes the architecture of a multiprocessor, called Hector, which exploits current microprocessor technology to produce a machine with good cost/performance tradeoff. A key design feature of Hector is an interconnection backplane that scales well with technology. This is achieved with simple hardware that has short critical paths in logic circuits and short lines in the interconnection network. The result is a system characterized by good performance, reliability and flexibility, which can be realized at a relatively low cost. An important aim of the Hector project is to develop an architecture suitable for construction of a general-purpose multiprocessor, where the cost is directly proportional to the size so that a low-cost entry-level machine is economically feasible, yet is scalable to larger sizes. In order to easily accommodate configurations with varying numbers of processors, Hector has a hierarchical structure. It features small bus sections interconnected by bit-parallel rings. The buses and rings can transfer data independently of each other so that the aggregate bandwidth increases proportionally to the number of these units. Hector is suitable for a wide range of applications. It can be effective for single jobs characterized by many parallel tasks, as well as for concurrent execution of multiple jobs consisting of predominantly serial tasks. That is, it is a machine that can be used effectively to run jobs typical of a Unix environment, as well as highly parallel commercial and scientific applications such as transaction systems, finite element analysis, and computer aided design.
A Comparative Evaluation of Techniques for Studying Parallel System Performance
, 1994
"... This paper presents a comparative and qualitative survey of techniques for evaluating parallel systems. We also survey metrics that have been proposed for capturing and quantifying the details of complex parallel system interactions. Experimentation, theoretical/analytical modeling and simulation ar ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
This paper presents a comparative and qualitative survey of techniques for evaluating parallel systems. We also survey metrics that have been proposed for capturing and quantifying the details of complex parallel system interactions. Experimentation, theoretical/analytical modeling and simulation are three frequently used techniques in performance evaluation. Experimentation uses real or synthetic workloads, usually called benchmarks, to measure and analyze their performance on actual hardware. Theoretical and analytical models are used to abstract details of a parallel system, providing the view of a simplified system parameterized by a limited number of degrees of freedom that are kept tractable. Simulation and related performance monitoring/visualization tools have become extremely popular becauseof their ability to capture the dynamic nature of the interaction between applications and architectures. We first present the figures of merit that are important for any performance evaluation technique. With respect to these figures of merit, we survey the three techniques and make a qualitative comparison of their pros and cons. In particular, for each of the above techniques we discuss: representative case studies; the underlying models that are used for the workload and the architecture; the feasibility and ease of quantifying standard performance metrics from the available statistics; the accuracy/validity of the output statistics; and the cost/effort that is expended in each evaluation strategy.

