Results 1 - 10
of
18
The strong correlation between code signatures and performance
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... A recent study [1] examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code si ..."
Abstract
-
Cited by 38 (10 self)
- Add to MetaCart
A recent study [1] examined the use of sampled hardware counters to create sampled code signatures. This approach is attractive because sampled code signatures can be quickly gathered for any application. The conclusion of their study was that there exists a fuzzy correlation between sampled code signatures and performance predictability. The paper raises the question of how much information is lost in the sampling process, and our paper focuses on examining this issue. We first focus on showing that there exists a strong correlation between code signatures and performance. We then examine the relationship between sampled and full code signatures, and how these affect performance predictability. Our results confirm that there is a fuzzy correlation found in recent work for the SPEC programs with sampled code signatures, but that a strong correlation exists with full code signatures. In addition, we propose converting the sampled instruction counts, used in the prior work, into sampled code signatures representing loop and procedure execution frequencies. These sampled loop and procedure code signatures allow phase analysis to more accurately and easily find patterns, and they correlate better with performance. 1
TurboSMARTS: Accurate microarchitecture simulation sampling in minutes
- SIGMETRICS Performance Evaluation Review
, 2005
"... Recent research proposes accelerating processor microarchitecture simulation through statistical sampling. These proposals advocate detailed microarchitecture simulation of a large number (e.g., 10,000) of brief (e.g., 1000-instruction) execution windows to minimize instructions simulated and achiev ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Recent research proposes accelerating processor microarchitecture simulation through statistical sampling. These proposals advocate detailed microarchitecture simulation of a large number (e.g., 10,000) of brief (e.g., 1000-instruction) execution windows to minimize instructions simulated and achieve high confidence in performance estimates. Unfortunately, correct measurement of such short execution windows requires highly accurate model state before each measurement. Prior techniques construct this state by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while emulating billions of instructions between measurements in an approach called functional warming. Although current sampling proposals require only minutes of detailed simulation, functional warming increases total turnaround time to hours.
Motivation for variable length intervals and hierarchical phase behavior
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
Simulation sampling with live-points
- In ISPASS ’06: Proceedings of the 2006 International Symposium on Performance Analysis of Systems and Software
, 2006
"... Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functi ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionallywarmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., ±3 % error with 99.7 % confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point library. 1.
Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior
, 2006
"... Computer systems increasingly depend on exploiting program dynamic behavior to optimize performance, power and reliability. Prior studies have shown that program execution exhibits phase behavior in both performance and power domains. Reliabilityoriented program phase behavior, however, remains larg ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Computer systems increasingly depend on exploiting program dynamic behavior to optimize performance, power and reliability. Prior studies have shown that program execution exhibits phase behavior in both performance and power domains. Reliabilityoriented program phase behavior, however, remains largely unexplored. As semiconductor transient faults (soft errors) emerge as a critical challenge to reliable system design, characterizing program phase behavior from a reliability perspective is crucial in order to apply dynamic fault-tolerant mechanisms and to optimize performance/reliability trade-offs. In this paper, we compute run-time program vulnerability to soft errors on four microarchitecture structures (i.e. instruction window, reorder buffer, function units and wakeup table) in a high-performance out-of-order execution superscalar processor. Experimental results on the SPEC2000 benchmarks show a considerable amount of time varying behavior in reliability measurements. Our study shows that a single performance metric, such as IPC, cache miss or branch misprediction, is not a good indicator for program vulnerability. The vulnerabilities of the studied microarchitecture structures are then correlated with program code-structure and run-time events to identify vulnerability phase behavior. We observed that both program code-structure and run-time events appear promising in classifying program reliability phase behavior. Overall, performance counter based schemes achieved an average Coefficient of Variation (COV) of 3.5%, 4.5%, 4.3 % and 5.7 % on the instruction queue, reorder buffer, function units and the wakeup table, while basic block vectors offer COVs of 4.9%, 5.8%, 5.4 % and 6 % on the four studied microarchitecture structures respectively. We found that in general, tracking performance metrics performs better than tracking control flow in identifying reliability phase behavior of applications. To our knowledge, this paper is the first to characterize program reliability phase behavior at the microarchitecture level. 1.
Characterizing phases in service-oriented applications
, 2004
"... The behavior of service-oriented programs depends strongly on the input. A compiler, for example, behaves differently when compiling different functions. Similar input dependences can be seen in interpreters, compression and encoding utilities, databases, and dynamic content servers. Because their b ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
The behavior of service-oriented programs depends strongly on the input. A compiler, for example, behaves differently when compiling different functions. Similar input dependences can be seen in interpreters, compression and encoding utilities, databases, and dynamic content servers. Because their behavior is hard to predict, these programs pose a special challenge for dynamic adaptation mechanisms, which attempt to enhance performance by modifying hardware or software to fit application needs. We present a new technique to detect phases—periods of distinctive behavior—in service-oriented programs. We begin by using special inputs to induce a repeating pattern of behavior. We then employ frequency-based filtering on basic block traces to detect both top-level and second-level repetitions, which we mark via binary rewriting. When the instrumented program runs, on arbitrary input, the inserted markers divide execution into phases of varying length. Experiments with service-oriented programs from the Spec95 and Spec2K benchmark suites indicate that program behavior within phases is surprisingly predictable in many (though not all) cases. This in turn suggests that dynamic adaptation, either in hardware or in software, may be applicable to a wider class of programs than previously believed. 1.
Efficiently evaluating speedup using sampled processor simulation
- Computer Architecture Letters
, 2004
"... Abstract—Cycle accurate simulation of processors is extremely time consuming. Sampling can greatly reduce simulation time while retaining good accuracy. Previous research on sampled simulation has been focusing on the accuracy of CPI. However, most simulations are used to evaluate the benefit of som ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract—Cycle accurate simulation of processors is extremely time consuming. Sampling can greatly reduce simulation time while retaining good accuracy. Previous research on sampled simulation has been focusing on the accuracy of CPI. However, most simulations are used to evaluate the benefit of some microarchitectural enhancement, in which the speedup is a more important metric than CPI. We employ the ratio estimator from statistical sampling theory to design efficient sampling to measure speedup and to quantify its error. We show that to achieve a given relative error limit for speedup, it is not necessary to estimate CPI to the same accuracy. In our experiment, estimating speedup requires about 9X fewer instructions to be simulated in detail in comparison to estimating CPI for the same relative error limit. Therefore using the ratio estimator to evaluate speedup is much more cost-effective and offers great potential for reducing simulation time. We also discuss the reason for this interesting and important result. I.
O.: Iddca: A new clustering approach for sampling
- In: MoBS: Workshop on Modeling, Benchmarking, and Simulation MoBS: Workshop on Modeling, Benchmarking, and Simulation
, 2005
"... Abstract. Clustering methods are machine-learning algorithms that can be used to easily select the most representative samples within a huge program trace. k-means is a popular clustering method for sampling. While k-means performs well, it has several shortcomings: (1) it depends on a random initia ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. Clustering methods are machine-learning algorithms that can be used to easily select the most representative samples within a huge program trace. k-means is a popular clustering method for sampling. While k-means performs well, it has several shortcomings: (1) it depends on a random initialization, so that clustering results may vary across runs; (2) the maximal number of clusters is a user-selected parameter, but its optimal value can be benchmark/trace-dependent; (3) k-means is a multi-pass algorithm which may be less practical for a large number of intervals. To solve these issues, we adapted an alternative clustering method, called DCA, to the issue of sampling. Unlike k-means, DCA and its sampling-specific adaptation ID-DCA do not require the user to be exposed to internal clustering parameters: it dynamically defines the number of clusters for each target program and the method parameters dynamically adapt to the target program. For an ordered input (e.g., a trace of intervals), the method is deterministic. Finally, it is an online and thus single-pass algorithm, resulting in a significant execution time gain over an existing and popular k-means implementation. Within the context of a variable-size sampling approach, we show that IDDCA can achieve an average CPI error of 1.62% over the 26 SPEC benchmarks, with a maximum error of 5.72% and an average of 403 million instructions. 1
Budgeted Region Sampling (BeeRS): Do Not Separate Sampling From Warm-Up, And Then Spend Wisely Your Simulation Budget
- in "5th IEEE International Symposium on Signal Processing and Information Technology 5th IEEE International Symposium on Signal Processing and Information Technology
, 2006
"... Abstract. While the recent surge of research articles on sampling started with rather large sample sizes, it has later shifted to very small intervals, and it is now converging to intermediate sizes, and even to varying sizes. With 100M samples, warm-up is not an issue, at least with current cache s ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. While the recent surge of research articles on sampling started with rather large sample sizes, it has later shifted to very small intervals, and it is now converging to intermediate sizes, and even to varying sizes. With 100M samples, warm-up is not an issue, at least with current cache sizes. However, with significantly smaller samples, warm-up becomes critical, especially when the sampling target accuracy is of the order of a few percent. However, in most sampling research works, warm-up has largely been treated as a separate issue. In this article, we advocate for an integrated approach at (simulator-based) warm-up and sampling. Instead of separating warm-up and sampling, we take exactly the opposite approach, provide a common instruction budget for warm-up and sampling, and we attempt to spend it as wisely as possible on either one.
Cross binary simulation points
- In International Symposium on Performance Analysis of Systems and Software (ISPASS
, 2007
"... Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the program’s execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the program’s execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations. 1

