Results 1 - 10
of
36
Multifacet’s general execution-driven multiprocessor simulator (gems) toolset
- SIGARCH Comput. Archit. News
, 2005
"... The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around whic ..."
Abstract
-
Cited by 124 (13 self)
- Add to MetaCart
The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven
Managing Wire Delay in Large Chip-Multiprocessor Caches
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2004
"... In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency bank ..."
Abstract
-
Cited by 90 (4 self)
- Add to MetaCart
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.
Token Coherence: Decoupling Performance and Correctness
, 2003
"... Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directo ..."
Abstract
-
Cited by 86 (15 self)
- Add to MetaCart
Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directory protocols) or require a totally-ordered interconnect (traditional snooping protocols) . Unfortunately, totally-ordered interconnects are difficult to implement in glueless designs. An ideal coherence protocol would avoid indirections and interconnect ordering; however, such an approach introduces numerous protocol races that are difficult to resolve.
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in . . .
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing l ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing latency for cache-to-cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache-to-cache misses. Recently proposed hybrid protocols trade-off latency and bandwidth by directly sending requests to a predicted destination set.
Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture
- SIGMETRICS Performance Evaluation Review
, 2004
"... The new focus on commercial workloads in simulation studies of server systems has caused a drastic increase in the complexity and decrease in the speed of simulation tools. The complexity of a large-scale full-system model makes development of a monolithic simulation tool a prohibitively difficult t ..."
Abstract
-
Cited by 43 (16 self)
- Add to MetaCart
The new focus on commercial workloads in simulation studies of server systems has caused a drastic increase in the complexity and decrease in the speed of simulation tools. The complexity of a large-scale full-system model makes development of a monolithic simulation tool a prohibitively difficult task. Furthermore, detailed fullsystem models simulate so slowly that experimental results must be based on simulations of only fractions of a second of execution of the modelled system. This paper presents SIMFLEX, a simulation framework which uses component-based design and rigorous statistical sampling to enable development of complex models and ensure representative measurement results with fast simulation turnaround. The novelty of SIMFLEX lies in its combination of a unique, compile-time approach to component interconnection and a methodology for obtaining accurate results from sampled simulations on a platform capable of evaluating unmodified commercial workloads. 1.
Adaptive Cache Compression for High-Performance Processors
- In Proc. ISCA
, 2004
"... Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access laten ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access latency, potentially degrading performance. In this paper, we develop an adaptive policy that dynamically adapts to the costs and benefits of cache compression. We propose a two-level cache hierarchy where the L1 cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way set-associative with LRU replacement, where each set can store up to eight compressed lines but has space for only four uncompressed lines. On each L2 reference, the LRU stack depth and compressed size determine whether compression (could have) eliminated a miss or incurs an unnecessary decompression overhead. Based on this outcome, the adaptive policy updates a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. We evaluate adaptive cache compression using full-system simulation and a range of benchmarks. We show that compression can improve performance for memory-intensive commercial workloads by up to 17%. However, always using compression hurts performance for low-miss-rate benchmarks—due to unnecessary decompression overhead—degrading performance by up to 18%. By dynamically monitoring workload behavior, the adaptive policy achieves comparable benefits from compression, while never degrading performance by more than 0.4%. 1
Coherence Ordering for Ring-based Chip Multiprocessors
, 2006
"... Ring interconnects may be an attractive solution for future chip multiprocessors because they can enable faster links than buses and simpler switches than arbitrary switched interconnects. Moreover, a ring naturally orders requests sufficiently to enable directory-less coherence, but not in the tota ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Ring interconnects may be an attractive solution for future chip multiprocessors because they can enable faster links than buses and simpler switches than arbitrary switched interconnects. Moreover, a ring naturally orders requests sufficiently to enable directory-less coherence, but not in the total order that buses provide for snooping coherence. Existing cache coherence protocols for rings either establish a (total) ordering point (ORDERING-POINT) or use a greedy order (GREEDY-ORDER) with unbounded retries. In this work, we propose a new class of ring protocols, RING-ORDER, in which requests complete in ring position order to achieve two benefits. First, RING-ORDER improves performance relative to ORDERING-POINT by activating requests immediately instead of waiting for them to reach the ordering point. Second, it improves performance stability relative to GREEDY-ORDER by not using retries. Thus, the new RING-ORDER combines the best of ORDERING-POINT (good performance stability) with the best of GREEDY-ORDER (good average performance).
Enhancing multiprocessor architecture simulation speed using matched-pair comparison
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2005
"... While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1 % of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to singleprocessor
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design
- In Proc. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS
, 2008
"... With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults in different microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than
Automatic logging of operation system effects to guide application-level architecture simulation
- In Proceedings of SIGMetrics/Performance 2006
, 2006
"... Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the application’s execution. Existing applicationlevel simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application’s execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch. In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our applicationlevel simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.

