Results 1 - 10
of
22
Online performance analysis by statistical sampling of microprocessor performance counters
- Proceedings of the 19th annual International Conference on Supercomputing (ICS’05
, 2005
"... Hardware performance counters (HPCs) are increasingly being used to analyze performance and identify the causes of performance bottlenecks. However, HPCs are difficult to use for several reasons. Microprocessors do not provide enough counters to simultaneously monitor the many different types of eve ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
(Show Context)
Hardware performance counters (HPCs) are increasingly being used to analyze performance and identify the causes of performance bottlenecks. However, HPCs are difficult to use for several reasons. Microprocessors do not provide enough counters to simultaneously monitor the many different types of events needed to form an overall understanding of performance. Moreover, HPCs primarily count low–level micro–architectural events from which it is difficult to extract high–level insight required for identifying causes of performance problems. We describe two techniques that help overcome these difficulties, allowing HPCs to be used in dynamic real–time optimizers. First, statistical sampling is used to dynamically multiplex HPCs and make a larger set of logical HPCs available. Using real programs, we show experimentally that it is possible through this sampling to obtain counts of hardware events that are statistically similar (within 15%) to complete non-sampled counts, thus allowing us to provide a much larger set of logical HPCs. Second, we observe that stall cycles are a primary source of inefficiencies, and hence they should be major targets for software optimization. Based on this observation, we build a simple model in real–time that speculatively associates each stall cycle to a processor component that likely caused the stall. The information needed to produce this model is obtained using our HPC multiplexing facility to monitor a large number of hardware components simultaneously. Our analysis shows that even in an out–of–order superscalar micro–processor such a speculative approach yields a fairly accurate model with run– time overhead for collection and computation of under 2%. These results demonstrate that we can effective analyze on–line performance of application and system code running at full speed. The stall analysis shows where performance is being lost on a given processor.
Automatic measurement of memory hierarchy parameters
- SIGMETRICS Perform. Eval. Rev
, 2005
"... On modern computers, the running time of many applica-tions is dominated by the cost of memory operations. To optimize such applications for a given platform, it is neces-sary to have a detailed knowledge of the memory hierarchy parameters of that platform. In practice, this information is usually p ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
(Show Context)
On modern computers, the running time of many applica-tions is dominated by the cost of memory operations. To optimize such applications for a given platform, it is neces-sary to have a detailed knowledge of the memory hierarchy parameters of that platform. In practice, this information is usually poorly documented if at all. Moreover, there is growing interest in self-tuning, autonomic software systems that can optimize themselves for different platforms, and these systems must determine memory hierarchy parame-ters automatically without human intervention. One solution is to use micro-benchmarks to determine the parameters of the memory hierarchy. In this paper, we argue that existing micro-benchmarks are inadequate, and present novel micro-benchmarks for determining the para-meters of all levels of the memory hierarchy, including reg-isters, all caches levels and the translation look-aside buffer. We have implemented these micro-benchmarks into an in-tegrated tool that can be ported with little effort to new platforms. We present experimental results that show that this tool successfully determines memory hierarchy parame-ters on many current platforms, and compare its accuracy with that of existing tools. 1.
Accuracy of Performance Counter Measurements
"... Many workload characterization studies depend on accurate measurements of the cost of executing a piece of code. Often these measurements are conducted using infrastructures to access hardware performance counters. Most modern processors provide such counters to count micro-architectural events such ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
Many workload characterization studies depend on accurate measurements of the cost of executing a piece of code. Often these measurements are conducted using infrastructures to access hardware performance counters. Most modern processors provide such counters to count micro-architectural events such as retired instructions or clock cycles. These counters can be difficult to configure, may not be programmable or readable from user-level code, and can not discriminate between events caused by different software threads. Various software infrastructures address this problem, providing access to per-thread counters from application code. This paper constitutes the first comparative study of the accuracy of three commonly used measurement infrastructures (perfctr, perfmon2, and PAPI) on three common processors (Pentium D, Core 2 Duo, and AMD ATHLON 64 X2). We find significant differences in accuracy of various usage patterns for the different infrastructures and processors. Based on these results we provide guidelines for finding the best measurement approach. 1.
Extracting and improving microarchitecture performance on reconfigurable architectures
- International Journal of Parallel Programming
, 2005
"... We describe our experience using reconfigurable architectures to develop an understanding of an application’s performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure pe ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
(Show Context)
We describe our experience using reconfigurable architectures to develop an understanding of an application’s performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the application’s performance. 1
Trees or Grids? Indexing Moving Objects in Main Memory
, 2009
"... New application areas, such as location-based services, rely on the efficient management of large collections of mobile objects. Maintaining accurate, up-to-date positions of these objects results in massive update loads that must be supported by spatial indexing structures and main-memory indexes ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
New application areas, such as location-based services, rely on the efficient management of large collections of mobile objects. Maintaining accurate, up-to-date positions of these objects results in massive update loads that must be supported by spatial indexing structures and main-memory indexes are usually necessary to provide high update performance. Traditionally, the R-tree and its variants were
MAQAO: Modular Assembler Quality Analyzer and Optimizer for
"... Quality of the code produced by compilers is essential to get high performance. Therefore, being able to assess precisely code quality is extremely important. This issue can be successfully tackled by using performance counters and dynamic profiling. In this paper, we advocate that in many interesti ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Quality of the code produced by compilers is essential to get high performance. Therefore, being able to assess precisely code quality is extremely important. This issue can be successfully tackled by using performance counters and dynamic profiling. In this paper, we advocate that in many interesting cases, a careful static analysis of assembly code can achieve similar results at a much lower cost and with a better accuracy. The principles of an automatic tool (MAQAO) for performing such an analysis are presented. Among its key advantages, MAQAO offers versatility (the user can specify a particular analysis using SQL formalism) and precise diagnosis capability which can be later used for carefully driving the optimization process. Two case studies on real codes are presented to illustrate the power of the tool: in each case, MAQAO helped us locate performance problems easily and define an optimization strategy leading to substantial code improvements (20 to 30 % on the overall appliaction execution time). 1
Phase-based Tuning for Better Utilization of Performance-Asymmetric Multicore Processors
"... Abstract—The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new challenges. For effective utilization of these performanceasymmetric multicore processors, code sections of a program must be assigned to cores such that the resource needs of ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract—The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new challenges. For effective utilization of these performanceasymmetric multicore processors, code sections of a program must be assigned to cores such that the resource needs of code sections closely matches resource availability at the assigned core. Determining this assignment manually is tedious, error prone, and significantly complicates software development. To solve this problem, we contribute a transparent and fully-automatic process that we call phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores. Compared to the stock Linux scheduler we see a 36 % average process speedup, while maintaining fairness and with negligible overheads. I.
X-Ray: Automatic measurement of hardware parameters
, 2004
"... Abstract There is growing interest in autonomic, self-tuning software that can optimize itself automatically on new platforms without manual intervention. Optimization requires detailed knowledge of the target platform such as the latency and throughput of instructions, the numbers of registers, an ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract There is growing interest in autonomic, self-tuning software that can optimize itself automatically on new platforms without manual intervention. Optimization requires detailed knowledge of the target platform such as the latency and throughput of instructions, the numbers of registers, and the organization of the memory hierarchy. An autonomic system needs to determine this kind of platform-specific information on its own. In this paper, we describe the design and implementation of X-Ray, which is a tool that autonomically measures a large number of such platform-specific parameters. For some of these parameters, we also describe novel algorithms, which are more robust than existing ones. X-Ray is written in C for maximum portability, and it is based on accurate timing of a number of carefully designed micro-benchmarks. A novel feature of X-Ray is that it is easily extensible because it provides simple infrastructure and a code generator that can be used to produce the large number of micro-benchmarks needed for such measurements. There are very few existing tools that address this problem. Our experiments show that X-Ray produces more accurate and more complete results than any of them.
Experiences Understanding Performance in a Commercial Scale-Out Environment
"... Abstract. Clusters of loosely connected machines are becoming an important model for commercial computing. The cost/performance ratio makes these scale-out solutions an attractive platform for a class of computational needs. The work we describe in this paper focuses on understanding performance whe ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Clusters of loosely connected machines are becoming an important model for commercial computing. The cost/performance ratio makes these scale-out solutions an attractive platform for a class of computational needs. The work we describe in this paper focuses on understanding performance when using a scale-out environment to run commercial workloads. We describe the novel scale-out environment we configured and the workload we ran on it. We explain the unique performance challenges faced in such an environment and the tools we applied and improved for this environment to address the challenges. We present data from the tools that proved useful in optimizing performance on our system. We discuss the lessons we learned applying and modifying existing tools to a commercial scale-out environment, and offer insights into making future performance tools effective in this environment. 1
Towards a Cross-Platform Microbenchmark Suite for Evaluating Hardware Performance Counter Data
"... As useful as performance counters are, the meaning of reported aggregate event counts is sometimes questionable. Questions arise due to unanticipated processor behavior, overhead associated with the interface, the granularity of the monitored code, hardware errors, and lack of standards with respect ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As useful as performance counters are, the meaning of reported aggregate event counts is sometimes questionable. Questions arise due to unanticipated processor behavior, overhead associated with the interface, the granularity of the monitored code, hardware errors, and lack of standards with respect to event definitions. To explore these issues, we are conducting a sequence of studies using carefully crafted microbenchmarks that permit the accurate prediction of event counts and investigation of the differences between hardware-reported and predicted event counts. This paper presents the methodology employed, some of the microbenchmarks developed, and some of the information uncovered to date. The information provided by this work allows application developers to better understand the data provided by hardware performance counters and better utilize it to tune application performance. A goal of this research is to develop a cross-platform microbenchmark suite that can be used by application developers for these purposes. Some of the microbenchmarks in this suite are discussed in the paper. 1.