Results 1 - 10
of
58
Structures for Phase Classification
, 2004
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all the intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we ex ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all the intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.
The strong correlation between code signatures and performance
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... A recent study [1] examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code si ..."
Abstract
-
Cited by 38 (10 self)
- Add to MetaCart
A recent study [1] examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance. 1
Simpoint 3.0: Faster and more flexible program analysis
- Journal of Instruction Level Parallelism
, 2005
"... This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an opti ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an option to output only the simulation points that represent the majority of execution, which can reduce simulation time without much increase in error. Finally, this release provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering. This paper describes SimPoint 3.0’s new features, how to use them, and points out some common pitfalls. 1
Transition phase classification and prediction
- In 11th International Symposium on High Performance Computer Architecture
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed on-line systems automatically group these similar intervals of execution into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. These systems are ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed on-line systems automatically group these similar intervals of execution into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. These systems are driven by algorithms that dynamically classify intervals of execution into phases and predict phase changes. In this paper, we examine several improvements to dynamic phase classification and prediction. The first improvement is to appropriately deal with phase transitions. This modification identifies phase transitions for what they are, instead of classifying them into a new phase, which increases phase prediction accuracy. We also describe an adaptive system that dynamically adjusts classification thresholds and splits phases with poor homogeneity. This modification increase the homogeneity of the hardware metrics across the intervals in each phase. We improve phase prediction accuracy by applying confidence to phase prediction, and we develop architectures that can accurately predict the outcome of the next phase change, and the length of the next phase. 1
Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors
- in ASPLOS-XI: Proceedings of the 11th international conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Multiple Clock Domain (MCD) processors are a promising future alternative to today’s fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches a ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Multiple Clock Domain (MCD) processors are a promising future alternative to today’s fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches are profile-based offline schemes which are mainly suitable for applications whose execution characteristics are constrained and repeatable. While some work has been published about online DVFS schemes, the prior approaches are typically heuristic-based. In this paper, we present an effective online DVFS scheme for an MCD processor which takes a formal analytic approach, is driven by dynamic workloads, and is suitable for all applications. In our approach, we model an MCD processor as a queue-domain network and the online DVFS as a feedback control problem with issue queue occupancies as feedback signals. A dynamic stochastic queuing model is first proposed and linearized through an accurate linearization technique. A controller is then designed and verified by stability analysis. Finally we evaluate our DVFS scheme through a cycle-accurate simulation with a broad set of applications selected from MediaBench and SPEC2000 benchmark suites. Compared to the best-known prior approach, which is heuristicbased, the proposed online DVFS scheme is substantially more effective due to its automatic regulation ability. For example, we have achieved a 2-3 fold increase in efficiency in terms of energy-delay product improvement. In addition, our control theoretic technique is more resilient, requires less tuning effort, and has better scalability as compared to prior online DVFS schemes. We believe that the techniques and methodology described in this paper can be generalized for energy control in processors other than MCD, such as tiled stream processors.
Discovering and Exploiting Program Phases
, 2003
"... he way a program traverses the code during execution. We can find this phase behavior and classify it by examining only the ratios in which different regions of code are being executed over time. We can simply and quickly collect this information using basic block vector profiles for off-line classi ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
he way a program traverses the code during execution. We can find this phase behavior and classify it by examining only the ratios in which different regions of code are being executed over time. We can simply and quickly collect this information using basic block vector profiles for off-line classification or through dynamic branch profiling for online classification. In addition, accurately capturing phase behavior through the computation of a single metric, independent of the underlying architectural details, means that it is pos- Timothy Sherwood Santa Barbara Erez Perelman Greg Hamerly Suleyman Sair North Carolina State University Brad Calder IN A SINGLE SECOND, A MODERN PROCESSOR CAN EXECUTE BILLIONS OF INSTRUCTIONS AND A PROGRAM'S BEHAVIOR CAN CHANGE MANY TIMES. SOME PROGRAMS CHANGE BEHAVIOR DRASTICALLY, SWITCHING BETWEEN PERIODS OF HIGH AND LOW PERFORMANCE, YET SYSTEM DESIGN AND OPTIMIZATION TYPICALLY FOCUS ON AVERAGE SYSTEM BEHAVIOR. INSTEAD OF ASSUMING
Fire-and-Forget: Load/Store Scheduling with No Store Queue at all
- In Proceedings of the 39th International Symposium on Microarchitecture
, 2006
"... Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high-ILP processors. Past research has proposed ways to make the SQ more scalable by reor ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high-ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our “Fire-and-Forget ” (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3 % speedup over a processor using a conventional fully-associative SQ. 1.
Motivation for variable length intervals and hierarchical phase behavior
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
The fuzzy correlation between code and performance predictability
- In Proceedings of the 37th International Symposium on Microarchitecture (MICRO
, 2004
"... Recent studies have shown that most SPEC CPU2K benchmarks exhibit strong phase behavior, and the Cycles per Instruction (CPI) performance metric can be accurately predicted based on program’s control-flow behavior, by simply observing the sequencing of the program counters, or extended instruction p ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Recent studies have shown that most SPEC CPU2K benchmarks exhibit strong phase behavior, and the Cycles per Instruction (CPI) performance metric can be accurately predicted based on program’s control-flow behavior, by simply observing the sequencing of the program counters, or extended instruction pointers (EIPs). One motivation of this paper is to see if server workloads also exhibit such phase behavior. In particular, can EIPs effectively predict CPI in server workloads? We propose using regression trees to measure the theoretical upper bound on the accuracy of predicting the CPI using EIPs, where accuracy is measure by the explained variance of CPI with EIPs. Our results show that for most server workloads and, surprisingly, even for CPU2K benchmarks, the accuracy of predicting CPI from EIPs varies widely. We classify the benchmarks into four quadrants based on their CPI variance and predictability of CPI using EIPs. Our results indicate that no single sampling technique can be broadly applied to a large class of applications. We propose a new methodology that selects the best-suited sampling technique to accurately capture the program behavior. 1.
Enhancing multiprocessor architecture simulation speed using matched-pair comparison
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2005
"... While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1 % of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to singleprocessor

