Results 1 - 10
of
360
Pin: building customized program analysis tools with dynamic instrumentation
- In PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
, 2005
"... Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and eff ..."
Abstract
-
Cited by 416 (20 self)
- Add to MetaCart
Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin’s rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application’s original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin’s versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium R ○ , and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging-code inspections and walk-throughs,
Phase Tracking and Prediction
, 2003
"... In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understa ..."
Abstract
-
Cited by 157 (19 self)
- Add to MetaCart
In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.
SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling
- in Proceedings of the 30th annual international symposium on Computer architecture
, 2003
"... Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents ..."
Abstract
-
Cited by 153 (19 self)
- Add to MetaCart
Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates. Analysis of 41 of the 45 possible SPEC2K benchmark/ input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3 % with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ~2 % for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64 % on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively. 1.
Reducing power density through activity migration
- In Proceedings of the International Symposium on Low Power Electronics and Design
, 2003
"... Power dissipation is unevenly distributed in modern microprocessors leading to localized hot spots with significantly greater die temperature than surrounding cooler regions. Excessive junction temperature reduces reliability and can lead to catastrophic failure. We examine the use of activity migra ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
Power dissipation is unevenly distributed in modern microprocessors leading to localized hot spots with significantly greater die temperature than surrounding cooler regions. Excessive junction temperature reduces reliability and can lead to catastrophic failure. We examine the use of activity migration which reduces peak junction temperature by moving computation between multiple replicated units. Using a thermal model that includes the temperature dependence of leakage power, we show that sustainable power dissipation can be increased by nearly a factor of two for a given junction temperature limit. Alternatively, peak die temperature can be reduced by 12.4 o C at the same clock frequency. The model predicts that migration intervals of around 20–200 µs are required to achieve the maximum sustainable power increase. We evaluate several different forms of replication and migration policy control.
Picking Statistically Valid and Early Simulation Points
, 2003
"... Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution fr ..."
Abstract
-
Cited by 91 (14 self)
- Add to MetaCart
Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To address this issue we have recently proposed using Simulation Points (found by only examining basic block execution frequency profiles) to increase the efficiency and accuracy of simulation. Simulation points are a small set of execution samples that when combined represent the complete execution of the program.
Characterizing and Predicting Program Behavior and its Variability
- In International Conference on Parallel Architectures and Compilation Techniques
, 2003
"... To reach the next level of performance and energy efficiency, optimizations are increasingly applied in a dynamic and adaptive manner. Current adaptive systems are typically reactive and optimize hardware or software in response to detecting a shift in program behavior. We argue that program behavio ..."
Abstract
-
Cited by 83 (3 self)
- Add to MetaCart
To reach the next level of performance and energy efficiency, optimizations are increasingly applied in a dynamic and adaptive manner. Current adaptive systems are typically reactive and optimize hardware or software in response to detecting a shift in program behavior. We argue that program behavior variability requires adaptive systems to be predictive rather than reactive. In order to be effective, systems need to adapt according to future rather than most recent past behavior. In this paper we explore the potential of incorporating prediction into adaptive systems. We study the time-varying behavior of programs using metrics derived from hardware counters on two different micro-architectures. Our evaluation shows that programs do indeed exhibit significant behavior variation even at a granularity of millions of instructions. In addition, while the actual behavior across metrics may be different, periodicity in the behavior is shared across metrics. We exploit these characteristics in the design of on-line statistical and table-based predictors. We introduce a new class of predictors, cross-metric predictors, that use one metric to predict another, thus making possible an efficient coupling of multiple predictors. We evaluate these predictors on the SPECcpu2000 benchmark suite and show that table-based predictors outperform statistical predictors by as much as 69 % on benchmarks with high variability. 1.
Positional Adaptation of Processors: Application to Energy Reduction
- In International Symposium on Computer Architecture
, 2003
"... Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurati ..."
Abstract
-
Cited by 69 (2 self)
- Add to MetaCart
Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurations and the application of the chosen configurations are associated with particular code sections. This is in contrast to the currently-used Temporal approach to adaptation, where both the testing and application of configurations are tied to successive intervals in time.
Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation
- In International Symposium on Microarchitecture
, 2004
"... Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions — called Pin-Points — for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium ® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95 % of the SPEC2000 programs we tested, the PinPoints prediction was within 8 % of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrialstrength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium ® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors. 1.
Measuring program similarity: Experiments with SPEC CPU benchmark suites
- in Proceedings of the 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05
, 2005
"... It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecturedependent metrics of p ..."
Abstract
-
Cited by 50 (8 self)
- Add to MetaCart
It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecturedependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9 % and 4.4 % error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same. 1.
Structures for Phase Classification
, 2004
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all the intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we ex ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where all the intervals in a phase have homogeneous behavior and similar resource requirements. In this paper we examine different program structures for capturing phase behavior. The goal is to compare the size and accuracy of these structures for performing phase classification. We focus on profiling the frequency of program level structures that are independent from underlying architecture performance metrics. This allows the phase classification to be used across different hardware designs that support the same instruction set (ISA). We compare using basic blocks, loop branches, procedures, opcodes, register usage, and memory address information for guiding phase classification. We compare these different structures in terms of their ability to create homogeneous phases, and evaluate the accuracy of using these structures to pick simulation points for SimPoint.

