Results 1 - 10
of
16
Measuring program similarity: Experiments with SPEC CPU benchmark suites
- in Proceedings of the 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05
, 2005
"... It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecturedependent metrics of p ..."
Abstract
-
Cited by 50 (8 self)
- Add to MetaCart
It is essential that a subset of benchmark programs used to evaluate an architectural enhancement, is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecturedependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that the results could be biased by the idiosyncrasies of the chosen configurations. The objective of this paper is to present a methodology to measure similarity of programs based on their inherent microarchitecture-independent characteristics which will make the results applicable to any microarchitecture. We apply our methodology to the SPEC CPU2000 benchmark suite and demonstrate that a subset of 8 programs can be used to effectively represent the entire suite. We validate the usefulness of this subset by using it to estimate the average IPC and L1 data cache miss-rate of the entire suite. The average IPC of 8-way and 16-way issue superscalar processor configurations could be estimated with 3.9 % and 4.4 % error respectively. This methodology is applicable not only to find subsets from a benchmark suite, but also to identify programs for a benchmark suite from a list of potential candidates. Studying the four generations of SPEC CPU benchmark suites, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained the same. 1.
Measuring Benchmark Similarity Using Inherent Program Characteristics,” Laboratory of Computer Architecture
, 2006
"... Abstract—This paper proposes a methodology for measuring the similarity between programs based on their inherent microarchitecture-independent characteristics, and demonstrates two applications for it: 1) finding a representative subset of programs from benchmark suites and 2) studying the evolution ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Abstract—This paper proposes a methodology for measuring the similarity between programs based on their inherent microarchitecture-independent characteristics, and demonstrates two applications for it: 1) finding a representative subset of programs from benchmark suites and 2) studying the evolution of four generations of SPEC CPU benchmark suites. Using the proposed methodology, we find a representative subset of programs from three popular benchmark suites—SPEC CPU2000, MediaBench, and MiBench. We show that this subset of representative programs can be effectively used to estimate the average benchmark suite IPC, L1 data cache miss-rates, and speedup on 11 machines with different ISAs and microarchitectures—this enables one to save simulation time with little loss in accuracy. From our study of the similarity between the four generations of SPEC CPU benchmark suites, we find that, other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have more or less remained unchanged. Index Terms—Measurement techniques, modeling techniques, performance of systems, performance attributes. æ 1
Improved automatic testcase synthesis for performance model validation
- Model Validation,” International Conference on Supercomputing
, 2005
"... The latest generation of high-performance IBM PowerPC microprocessors, the POWER5 chip, poses challenges to performance modeling, model simulation, and performance model validation efforts. To achieve accurate performance projections, the performance models for this high-performance processor must b ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The latest generation of high-performance IBM PowerPC microprocessors, the POWER5 chip, poses challenges to performance modeling, model simulation, and performance model validation efforts. To achieve accurate performance projections, the performance models for this high-performance processor must be written at a very detailed level, which reduces the efficiency of the model simulation. The detail also exacerbates the problem of performance model validation, which seeks to execute codes and compare results between performance models and hardware or functional models built from hardware descriptions of the machine. The current state-of-the-art is to use simple hand-coded bandwidth and latency testcases, but these are not comprehensive for processors as complex as the POWER5 chip. Applications and benchmark suites such as SPEC CPU are difficult to set up or take too long to execute on functional models or even on detailed performance models. We present an automatic testcase synthesis methodology to address these concerns. By basing testcase synthesis on the workload characteristics of an application, source code is created that largely represents the performance of the application, but which executes in a fraction of the runtime. We synthesize representative PowerPC versions of the SPEC2000, STREAM, TPC-C and Java benchmarks, compile and execute them, and obtain an average IPC within 2.4 % of the average IPC of the original benchmarks and with many similar average workload characteristics. The synthetic testcases often execute two orders of magnitude faster than the original applications, typically in less than 300K instructions, making performance model validation for today’s complex processors feasible. 1.
Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
- Proceedings of the IEEE International Symposium on Workload Characterization, 2006
"... Many embedded real world applications are intellectual property, and vendors hesitate to share these proprietary applications with computer architects and designers. This poses a serious problem for embedded microprocessor designers – how do they customize the design of their microprocessor to provi ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Many embedded real world applications are intellectual property, and vendors hesitate to share these proprietary applications with computer architects and designers. This poses a serious problem for embedded microprocessor designers – how do they customize the design of their microprocessor to provide optimal performance for a class of target customer applications? In this paper, we explore a technique that can automatically extract key performance attributes of a real world application and clone them into a synthetic benchmark. The advantage of the synthetic benchmark clone is that it hides functional meaning of the code but exhibits similar performance characteristics as the target application. Unlike previously proposed workload synthesis techniques, we only model microarchitectureindependent performance attributes into the synthetic clone. By using a set of embedded benchmarks from the MediaBench and MiBench suites, we demonstrate that the performance and power consumption of the synthetic clone correlates well with that of the original application across a wide range of microarchitecture configurations. 1.
Four generations of SPEC CPU benchmarks: what has changed and what has not
, 2004
"... Standard Performance Evaluation Corporation (SPEC) CPU benchmark suite which was first released in 1989 as a collection of 10 computation-intensive benchmark programs (average size of 2.5 billion dynamic instructions per program), is now in its fourth generation and has grown to 26 programs (average ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Standard Performance Evaluation Corporation (SPEC) CPU benchmark suite which was first released in 1989 as a collection of 10 computation-intensive benchmark programs (average size of 2.5 billion dynamic instructions per program), is now in its fourth generation and has grown to 26 programs (average size of 230 billion dynamic instructions per program). In order to keep pace with the architectural enhancements, technological advancements, software improvements, and emerging workloads, new programs were added, programs susceptible to compiler attacks were retired, program run times were increased, and memory activity of programs was increased in every generation of the benchmark suite. The objective of this paper is to understand how the inherent characteristics of SPEC benchmark programs have evolved over the last 1.5 decades – which aspects have changed and which have not. We measured and analyzed a collection of microarchitecture-independent metrics related to the instruction mix, data locality, branch predictability, and parallelism to understand the changes in generic workload characteristics with the evolution of benchmark suites. Surprisingly, we find that other than a dramatic increase in the dynamic instruction count and increasingly poor temporal data locality, the inherent program characteristics have pretty much remained unchanged. We also observe that SPEC CPU2000 benchmark suite is more diverse than its ancestors, but still has a over 50 % redundancy in programs. Based on our key findings and learnings from this study: (i) we make recommendations to SPEC that will be useful in selecting programs for future benchmark suites, (ii) speculate about the trend of future SPEC CPU benchmark workloads, and (iii) provide a scientific methodology for selecting representative workloads should the cost of simulating the entire benchmark be prohibitively high. 1.
Measuring program similarity
- Proceedings of the Intl Symposium on Performance of Systems and Software
, 2005
"... Performance evaluation using only a subset of programs from a benchmark suite is commonplace in computer architecture research. This is especially true during early design space exploration when a variety of enhancements need to be evaluated to reach a good microprocessor architecture in a limited a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Performance evaluation using only a subset of programs from a benchmark suite is commonplace in computer architecture research. This is especially true during early design space exploration when a variety of enhancements need to be evaluated to reach a good microprocessor architecture in a limited amount of time. When such a subset of benchmark programs is used for performance evaluation of architectural enhancements, it is essential that the subset is well distributed within the target workload space rather than clustered in specific areas. Past efforts for identifying subsets have primarily relied on using microarchitecture-dependent metrics of program performance, such as cycles per instruction and cache miss-rate. The shortcoming of this technique is that
Chameleon: A Framework for Observing, Understanding, and Imitating the Memory Behavior of Applications
"... Abstract. In this work, we present an integrated solution to three classic problems in the field of performance analysis: memory modeling, synthetic address trace generation, and the creation of synthetic benchmark proxies for applications. First, we describe an intuitive characterization of memory ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. In this work, we present an integrated solution to three classic problems in the field of performance analysis: memory modeling, synthetic address trace generation, and the creation of synthetic benchmark proxies for applications. First, we describe an intuitive characterization of memory access locality that can accurately predict an application’s hit rates on arbitrary cache configurations, even when block sizes and cache depths change. We then describe the implementation of a memory tracer that can extract this characterization from applications and a software tool that can generate synthetic address traces to match. Lastly, we describe Chameleon, a fully tunable synthetic benchmark whose memory behavior can be dictated by the traces described above. We show that applications and their Chameleon counterparts display highly similar memory behavior as measured by simulated and observed cache hit rates. 1
Distilling the Essence of Proprietary Workloads into Miniature Benchmarks
"... Benchmarks set standards for innovation in computer architecture research and industry product development. Consequently, it is of paramount importance that the workload used in computer architecture research and development is representative of real-world applications. However, composing such repre ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Benchmarks set standards for innovation in computer architecture research and industry product development. Consequently, it is of paramount importance that the workload used in computer architecture research and development is representative of real-world applications. However, composing such representative workloads poses practical challenges to application analysis teams and benchmark developers – (1) real-world workloads are intellectual property and vendors hesitate to share these proprietary applications; and (2) porting and reducing these applications to benchmarks that can be simulated in a tractable amount of time is a non-trivial task. In this paper we address this problem by proposing a technique that automatically distills key inherent performance attributes of a proprietary workload and captures them into a miniature synthetic benchmark clone. The advantage of the benchmark clone is that it hides the functional meaning of the code but exhibits similar performance characteristics as the target application. Moreover, the dynamic instruction count of the synthetic benchmark clone is substantially shorter than the proprietary application, greatly reducing overall simulation time – for SPEC CPU, the simulation time reduction is over five orders of magnitude compared to entire benchmark execution. By using a set of benchmarks representative of general-purpose, scientific, and embedded applications, we demonstrate that the power and performance characteristics of the synthetic benchmark clone correlate well with
Accurate memory signatures and synthetic address traces for hpc applications
- In ICS ’08: Proceedings of the 22nd Annual International Conference on Supercomputing
, 2008
"... Though the performance of many scientific codes is dominated by memory behavior, our ability to describe, capture, compare, and recreate that behavior is quite limited. This inability underlies much of the complexity in the field of performance analysis: it is fundamentally difficult to relate bench ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Though the performance of many scientific codes is dominated by memory behavior, our ability to describe, capture, compare, and recreate that behavior is quite limited. This inability underlies much of the complexity in the field of performance analysis: it is fundamentally difficult to relate benchmarks and applications or use realistic workloads to guide system design and procurement. An observable, reproducible, and machine-independent memory characterization is needed. The Chameleon framework is a software suite that includes tools to capture a concise memory signature from any application and produce synthetic memory address traces that mimic that signature. By simultaneously modeling spatial and temporal locality, Chameleon produces uniquely accurate, general-purpose synthetic traces. We demonstrate that the cache hit rates generated by each synthetic trace are nearly identical to those of the application it targets on dozens of memory hierarchies representing many of today’s commercial offerings. We apply the framework to high-performance computing (HPC) by leveraging sampling techniques to capture the memory signatures of full-scale, parallel applications with only a 5x slowdown. The overall result is therefore a concise, observable, and machine-independent representation of the memory requirements of full-scale applications that can be tractably captured and accurately mimicked. Categories and Subject Descriptors D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures
unknown title
"... a very fast rate and new features are being added to the current ones everyday. A computer architect is always looking for quicker ways to evaluate performance of the design on these applications in a relatively shorter time. Simulation is a very popular tool used in the early design phase of a micr ..."
Abstract
- Add to MetaCart
a very fast rate and new features are being added to the current ones everyday. A computer architect is always looking for quicker ways to evaluate performance of the design on these applications in a relatively shorter time. Simulation is a very popular tool used in the early design phase of a microprocessor. But applications run for a long time and microprocessors are getting complex which leads to very high simulation times. With a reasonably quick workload characterization we should be able to predict performance of the new application on the system(s). This paper proposes two simple techniques to predict performance based on the similarity of the new application with already characterized benchmarks whose performance numbers are available. Each of these techniques is then used to show how speedup and cache miss-rates can be predicted for a given program. 1.

