Results 1 - 10
of
22
SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling
- in Proceedings of the 30th annual international symposium on Computer architecture
, 2003
"... Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents ..."
Abstract
-
Cited by 153 (19 self)
- Add to MetaCart
Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates. Analysis of 41 of the 45 possible SPEC2K benchmark/ input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3 % with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ~2 % for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64 % on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively. 1.
Picking Statistically Valid and Early Simulation Points
, 2003
"... Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution fr ..."
Abstract
-
Cited by 91 (14 self)
- Add to MetaCart
Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.
Reducing State Loss for Effective Trace Sampling of Superscalar Processors
- In Proceedings of the 1996 International Conference on Computer Design (ICCD
, 1996
"... There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handlingstrategies. The performance of a given design is measured through the execution of application progra ..."
Abstract
-
Cited by 88 (2 self)
- Add to MetaCart
There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handlingstrategies. The performance of a given design is measured through the execution of application programs and other workloads. Presently, trace-driven simulation is the most popular method of processor performance analysis in the development stage of system design. Current techniques of trace-driven simulation, however, are extremely slow and expensive. In this paper, a fast and accurate method for statistical trace sampling of superscalar processors is proposed. 1
A FRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE
, 1997
"... This dissertation presents a statistical approach to modeling superscalar processor performance. Instead of directly modeling an execution trace, as with standard simulation-based performance models, a statistical model works with the probabilities of instruction types, instruction sequences, and p ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
This dissertation presents a statistical approach to modeling superscalar processor performance. Instead of directly modeling an execution trace, as with standard simulation-based performance models, a statistical model works with the probabilities of instruction types, instruction sequences, and processor states. The program trace and machine are analyzed separately, and the performance is com-puted from these two inputs. The statistical flow graph is introduced as a compact repre-sentation for program traces. The characterization of a specific processor and the statistical flow graph for a specific benchmark are combined to form a Markov chain. In order to reduce the state space size, this Markov chain is partitioned into several smaller submodels. Simulation-based techniques require extremely long run times, especially as traces reach lengths in the billions of instructions. The statistical approach presented here dramatically reduces the time required to explore a microarchitectural design space. Separating the program and machine models allows the time-consuming part of the modeling process,
Quantifying the Impact of Input Data Sets on Program Behavior and its Applications
- Journal of Instruction-Level Parallelism
, 2003
"... Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select ..."
Abstract
-
Cited by 38 (15 self)
- Add to MetaCart
Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. In this paper, we use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis to efficiently explore the workload space. Within this workload space, different input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral differences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The final goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, we discuss two other possible applications, namely getting insight in the impact of input data sets on program behavior and evaluating the representativeness of sampled traces. 1.
TurboSMARTS: Accurate microarchitecture simulation sampling in minutes
- SIGMETRICS Performance Evaluation Review
, 2005
"... Recent research proposes accelerating processor microarchitecture simulation through statistical sampling. These proposals advocate detailed microarchitecture simulation of a large number (e.g., 10,000) of brief (e.g., 1000-instruction) execution windows to minimize instructions simulated and achiev ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Recent research proposes accelerating processor microarchitecture simulation through statistical sampling. These proposals advocate detailed microarchitecture simulation of a large number (e.g., 10,000) of brief (e.g., 1000-instruction) execution windows to minimize instructions simulated and achieve high confidence in performance estimates. Unfortunately, correct measurement of such short execution windows requires highly accurate model state before each measurement. Prior techniques construct this state by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while emulating billions of instructions between measurements in an approach called functional warming. Although current sampling proposals require only minutes of detailed simulation, functional warming increases total turnaround time to hours.
Accelerating multiprocessor simulation with a memory timestamp record
- In ISPASS-2005
, 2005
"... We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated dur ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We introduce a fast and accurate technique for initializing the directory and cache state of a multiprocessor system based on a novel software structure called the memory timestamp record (MTR). The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cachecoherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15 % of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45× speedup over FFW, and a 7.7 × speedup over our detailed baseline. 1
Simulation sampling with live-points
- In ISPASS ’06: Proceedings of the 2006 International Symposium on Performance Analysis of Systems and Software
, 2006
"... Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functi ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionallywarmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., ±3 % error with 99.7 % confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point library. 1.
Accuracy and Speed--Up of Parallel Trace--Driven Architectural Simulation
- In Proc. Int’l Parallel Processing Symp., IEEE Computer Soc
, 1997
"... Trace--driven simulation continues to be one of the main evaluation methods in the design of high performance processor--memory sub--systems. In this paper, we examine the varying speed--up opportunities available by processing a given trace in parallel on an IBM SP--2 machine. We also develop a sim ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Trace--driven simulation continues to be one of the main evaluation methods in the design of high performance processor--memory sub--systems. In this paper, we examine the varying speed--up opportunities available by processing a given trace in parallel on an IBM SP--2 machine. We also develop a simple, yet effective method of correcting for cold--start cache miss errors, by the use of overlapped trace chunks. We then report selected experimental results to validate our expectations. We show that it is possible to achieve near--perfect speed--up without loss of accuracy. Next, in order to achieve further reduction in simulation cost, we combine uniform sampling methods with parallel trace processing with a slight loss of accuracy for finite--cache timer runs. We then show that by using warm--start sequences from preceding trace chunks, it is possible to reduce the errors back to acceptable bounds. 1. Introduction The ever--increasing sizes of real workloads is making the use of trace...
Applying Programming Language Implementation Techniques to PROCESSOR SIMULATION
, 2000
"... This memoization makes the simulator run 5--12 times faster, with no change in simulation results (e.g., cycle count). Combining direct-execution and memoization, FastSim simulates a MIPS R10000-like microarchitecture with a 190--360 times slowdown (i.e., simulation time over native benchmark execut ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This memoization makes the simulator run 5--12 times faster, with no change in simulation results (e.g., cycle count). Combining direct-execution and memoization, FastSim simulates a MIPS R10000-like microarchitecture with a 190--360 times slowdown (i.e., simulation time over native benchmark execution time on the host), which is an order of magnitude faster than SimpleScalar.

