Results 1 - 10
of
50
Using the SimOS Machine Simulator to Study Complex Computer Systems
- ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION
, 1997
"... ... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simul ..."
Abstract
-
Cited by 144 (5 self)
- Add to MetaCart
... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simulation models for each hardware component. By selecting the appropriate combination of simulation models, the user can explicitly control the tradeoff between simulation speed and simulation detail. To handle the large amount of low-level data generated by the hardware simulation models, SimOS contains flexible annotation and event classification mechanisms that map the data back to concepts meaningful to the user. SimOS has been extensively used to study new computer hardware designs, to analyze application performance, and to study operating systems. We include two case studies that demonstrate how a low-level machine simulator such as SimOS can be used to study large and complex workloads.
The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures
, 1996
"... Most publicly-available simulation tools only simulate RISC architectures. These tools cannot capture the instruction mix and memory reference patterns of CISC architectures. In this paper, we present an overview of Augmint, an execution-driven multiprocessor simulation toolkit that fills this gap b ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
Most publicly-available simulation tools only simulate RISC architectures. These tools cannot capture the instruction mix and memory reference patterns of CISC architectures. In this paper, we present an overview of Augmint, an execution-driven multiprocessor simulation toolkit that fills this gap by supporting Intel x86 architectures. Augmint also supports trace-driven simulation for uniprocessors as well as multiprocessors, with minor effort on the part of simulator developers. Augmint runs m4-macro-extended C and C++ applications such as those in the SPLASH and SPLASH-2 benchmark suites. Augmint supports a threadbased programming model with shared global address space and private stack space. Augmint supports a simulator interface compatible with that of the MINT simulation toolkit for MIPS architectures, thus allowing the reuse of most architecture simulators written for MINT. Augmint simulations run on x86-based uniprocessor systems under UNIX or Windows NT 2 . The source code ...
Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems
- In Proceedings of the Ninth International Parallel Processing Symposium
, 1994
"... The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills a ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in distributed shared memory multiprocessors. Even with the very simple policy of first-use placement, we find significant improvements over round-robin placement for many applications on both hardware- and software-coherent systems. For most of our applications, first-use placement allows 35 to 75 percent of cache fills to be performed locally, resulting in performance improvements of up to 40 percent with respect to round-robin placement. We were surprised to find no performance advantage in more sophisticated policies, including page migration and page replication. In fact, in many cases the performance of our applications suffered under these policies. 1 Introduction ...
Software Cache Coherence for Large Scale Multiprocessors
- In Proceedings of the First International Symposium on High Performance Computer Architecture
, 1994
"... Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however, and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance-competitive solution. We claim that an intermediate hardware option---memory-mapped network interfaces that support a global physical address space---can provide most of the performance benefits of hardware cache coherence. We present a software coherence protocol that runs on this class of machines and greatly narrows the performance gap between hardware and software coherence. We compare the performance of the protocol to that of existing software and hardware alternatives and evaluate the tradeoffs among various cache-write policies. We also observe that simple program changes can greatly...
Execution-Driven Simulation of Multiprocessors: Address and Timing Analysis
- ACM Transactions on Modeling and Computer Simulation
, 1994
"... This paper describes and evaluates an efficient execution-driven technique for the simulation of multiprocessors that includes the simulation of system memory and is driven by real program workloads. The technique produces correctly interleaved address traces at run time without disk access overhead ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper describes and evaluates an efficient execution-driven technique for the simulation of multiprocessors that includes the simulation of system memory and is driven by real program workloads. The technique produces correctly interleaved address traces at run time without disk access overhead or hardware support, allowing accurate simulation of the effects of a variety of architectural alternatives on programs. We have implemented a simulator based on this technique that offers substantial advantages in terms of reduced time and space overheads when compared to instruction-driven or trace-driven simulation techniques, without significant loss of accuracy. The paper presents the results of several validation experiments used to quantify the accuracy and efficiency of the simulator for sequential, distributed, and shared-memory multiprocessors, and several parallel programs. These experiments show that prediction errors of less than 5% compared to actual execution times and overhe...
Augmint - A Multiprocessor Simulation Environment for Intel x86 architectures
, 1996
"... Augmint is a fast execution driven multiprocessor simulator for Intel x86 architectures. It is based on MINT [1], but provides a user interface similar to that of Tangolite [2]. For the sake of simulation speed, processors are modelled as user level threads. A user defined memory hierarchy simulator ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Augmint is a fast execution driven multiprocessor simulator for Intel x86 architectures. It is based on MINT [1], but provides a user interface similar to that of Tangolite [2]. For the sake of simulation speed, processors are modelled as user level threads. A user defined memory hierarchy simulator can be plugged into the simulator to study the behaviour of the architecture under consideration. 1.1 Introduction The design of the perfect memory hierarchy for scalable distributed shared memory architectures remains elusive, even today. Given that hardware prototyping and analytical modelling are either inflexible or inadequate, quite a few simulators have been written in the past to study the behaviour of parallel applications on these machines. Simulating a multiprocessor architecture is a tradeoff between simulation time and the accuracy of the results obtained. The more the accuracy, the lesser the speed. There are simulation systems that perform cycle by cycle simulations, a meth...
Paint: PA Instruction Set Interpreter
, 1996
"... This document describes Paint, an instruction set simulator based on Mint[3]. Paint interprets ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
This document describes Paint, an instruction set simulator based on Mint[3]. Paint interprets
High performance software coherence for current and future architectures
- Journal of Parallel and Distributed Computing
, 1995
"... Shared memory provides an attractive and intuitive pro-gramming model for large-scale parallel computing, but re-quires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shar ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
Shared memory provides an attractive and intuitive pro-gramming model for large-scale parallel computing, but re-quires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance var-ies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems pro-vide acceptable performance on only a limit class of applica-tions. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hard-ware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of ma-chines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA ma-chines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. a 1995 Academic press, I~C. 1.
Eager Combining: A Coherency Protocol for Increasing Effective Network and Memory Bandwidth in Shared-Memory Multiprocessors
- In 6th IEEE Symposium on Parallel and Distributed Processing
, 1994
"... An excessive number of remote accesses or a nonuniform distribution of remote accesses can cause even well-designed multiprocessors to exhibit severe memory and network contention. Producer/consumer data generates a particularly common sharing pattern that results in a non-uniform distribution of re ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
An excessive number of remote accesses or a nonuniform distribution of remote accesses can cause even well-designed multiprocessors to exhibit severe memory and network contention. Producer/consumer data generates a particularly common sharing pattern that results in a non-uniform distribution of references. In this paper we quantify the performance impact of producer/consumer sharing as a function of memory and network bandwidth, and argue that the contention caused by this form of sharing severely impacts performance on large-scale machines. We propose a new coherency protocol, called eager combining, which is designed to alleviate this contention. We use execution-driven simulation of parallel programs on a large-scale multiprocessor to show that eager combining can improve the performance of programs with producer /consumer data by a factor of 4 or more. 1 Introduction One common cause of poor performance in largescale shared-memory multiprocessors is limited memory or interconne...
Fast Mutual Exclusion, Even With Contention
, 1993
"... We present a mutual exclusion algorithm that performs well both with and without contention, on machines with no atomic instructions other than read and write. The algorithm capitalizes on the ability of memory systems to read and write at both full- and half-word granularities. It depends on pred ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
We present a mutual exclusion algorithm that performs well both with and without contention, on machines with no atomic instructions other than read and write. The algorithm capitalizes on the ability of memory systems to read and write at both full- and half-word granularities. It depends on predictable processor execution rates, but requires no bound on the length of critical sections, performs only O(n) total references to shared memory when arbitrating among conflicting requests (rather than O(n ) in the general version of Lamport's fast mutual exclusion algorithm), and performs only 2 reads and 4 writes (a new lower bound) in the absence of contention. We provide a correctness proof.

