Results 1 - 10
of
40
Wattch: A Framework for Architectural-Level Power Analysis and Optimizations
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high ..."
Abstract
-
Cited by 843 (34 self)
- Add to MetaCart
Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.
Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications
, 2001
"... Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rath ..."
Abstract
-
Cited by 230 (31 self)
- Add to MetaCart
Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution. This approach is based upon using profiles of a program's code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.
A Large, Fast Instruction Window for Tolerating Cache Misses
"... Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle ti ..."
Abstract
-
Cited by 94 (1 self)
- Add to MetaCart
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Choosing Representative Slices of Program Execution for Microarchitecture Simulations: A Preliminary Application to the Data Stream
- In Workload Characterization of Emerging Applications
, 2000
"... Microarchitecture simulations are aimed at providing results representative of the behavior of a processor running an application. Due to CPU time constaints, only a few execution slices of a large application can be simulated. The aim of this paper is to propose a technique to choose a few prog ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Microarchitecture simulations are aimed at providing results representative of the behavior of a processor running an application. Due to CPU time constaints, only a few execution slices of a large application can be simulated. The aim of this paper is to propose a technique to choose a few program execution slices representative of the entire execution. We characterize the behavior of each consecutive slice executed. Then we use a statistical classification method to discriminate the execution slices and select the representative ones. In this paper, we detail this approach and apply it to the data stream. Using data cache simulations on the SPEC95 programs, we show that slices representing 1.46 % (average upon all the SPEC95 but one) of the overall program activity are as representative as trace sampling using a 10 % sampling ratio. keywords: micro-architecture simulation, trace-driven simulation, on-thefly simulation, trace sampling, clustering, classification, data stre...
Out-of-order commit processors
- In Proceedings of the 10th International Symposium on High Performance Computer Architecture
, 2004
"... Modern out-of-order processors tolerate long latency memory operations by supporting a large number of inflight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data s ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of inflight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the Reorder Buffer (ROB), the general purpose instructions queues, the Load/Store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. In this paper we propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10 % for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200 % improvement over a current processor with a similar number of entries. 1.
A Statistically Rigorous Approach for Improving Simulation Methodology
- IN PROCEEDINGS OF THE NINTH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-9
, 2002
"... Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulati ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: 1) identify key processor parameters, 2) classify benchmarks based on how they affect the processor, and 3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor.
Speculative Updates of Local and Global Branch History: A Quantitative Analysis
- Journal of Instruction-Level Parallelism
, 1998
"... In today's wide-issue processors, even small branch-misprediction rates introduce substantial performance penalties. Worse yet, inadequate branch prediction creates a bottleneck at the fetch stage, restricting other opportunities for improving performance. The choice of how to predict conditional- ..."
Abstract
-
Cited by 34 (8 self)
- Add to MetaCart
In today's wide-issue processors, even small branch-misprediction rates introduce substantial performance penalties. Worse yet, inadequate branch prediction creates a bottleneck at the fetch stage, restricting other opportunities for improving performance. The choice of how to predict conditional-branch outcomes is the primary lever on prediction accuracy. But the choice of when to update the predictor with branch outcomes is a second powerful lever, and the subject of this paper. In history-based predictors like gshare, many mispredictions result from commit-time update of the history: typical pipelined processors predict branches in the fetch stage, but update the predictor in the commit stage, making the predictor's state temporarily outof -date. As pipelines grow longer---in particular, when branches can spend many cycles in the instruction window waiting to issue---this problem becomes worse. Prior work on this subject has discussed the need for speculative update in a gl...
Minimal Subset Evaluation: Rapid Warm-up for Simulated Hardware State
- In Proceedings of the 2001 International Conference on Computer Design
, 2001
"... This paper introduces minimal subset evaluation (MSE) as a way to reduce time spent on large-structure warm-up during the fastforwarding portion of processor simulations. Warm up is commonly used prior to full-detail simulation to avoid cold-start bias in large structures like caches and branch pred ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
This paper introduces minimal subset evaluation (MSE) as a way to reduce time spent on large-structure warm-up during the fastforwarding portion of processor simulations. Warm up is commonly used prior to full-detail simulation to avoid cold-start bias in large structures like caches and branch predictors. Unfortunately, warm up can be very time consuming, often representing 50% or more of total simulation time. Previous techniques have used the entire fast-forward interval to obtain accurate warm up, which may be prohibitive for large parameter-space searches, or chosen a short but ad-hoc warm-up length that reduces simulation time but may sacrifice accuracy. MSE probabilistically determines a minimally sufficient fraction of the set of fast-forward transactions that must be executed for warm up to accurately produce state as it would have appeared had the entire fast-forward interval been used for warm up. The paper describes the mathematical underpinnings of MSE and demonstrates its effectiveness for both single-large-sample and multiple-sample simulation styles. In our experiments, MSE yields errors of less than 1% in IPC measurements with cycle-accurate simulation, while reducing simulation times by an average factor of two or more. 1
Memory Reference Reuse Latency: Accelerated Sampled Microarchitecture Simulation
- In Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
, 2002
"... This paper explores techniques for speeding up sampled microprocessor simulations by exploiting the observation that of the memory references that precede a sample, references that occur nearest to the sample are more likely to be germane during the sample itself. This means that accurately warming ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
This paper explores techniques for speeding up sampled microprocessor simulations by exploiting the observation that of the memory references that precede a sample, references that occur nearest to the sample are more likely to be germane during the sample itself. This means that accurately warming up simulated cache and branch predictor state only requires that a subset of the memory references and control-flow instructions immediately preceding a simulation sample need to be modeled. Our technique takes measurements of the memory reference reuse latencies (MRRLs) and uses these data to choose a point prior to each sample to engage cache hierarchy and branch predictor modeling.
A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions
- In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
, 2000
"... The need for accurate conditional-branch prediction is well known: mispredictions waste large numbers of cycles, inhibit outof -order execution, and waste power on mis-speculated computation. Prior work on branch-predictor organization has focused mainly on how to reduce conflicts in the branch-pred ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
The need for accurate conditional-branch prediction is well known: mispredictions waste large numbers of cycles, inhibit outof -order execution, and waste power on mis-speculated computation. Prior work on branch-predictor organization has focused mainly on how to reduce conflicts in the branch-predictor structures, while relatively little work has explored other causes of mispredictions. Some prior work has identified other categories of mispredictions, but this paper organizes these categories into a broad taxonomy of misprediction types. Using the taxonomy, this paper goes on to show that other categories---especially wronghistory mispredictions---are often more important than conflicts. This is true even if just a very simple conflict-reduction technique is used. Based on these observations, this paper proposes alloying local and global history together in a two-level branch predictor structure. This simple technique, a generalization of the bi-mode predictor, attacks wrong-histo...

