Results 11 - 20
of
22
A fast and accurate framework to analyze and optimize cache memory behavior
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 2004
"... The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performan ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed. Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature. This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.
Experiment Management Support for Performance Tuning
- PROCEEDINGS OF THE SC’97 CONFERENCE
, 1997
"... The development of a high-performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
The development of a high-performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing performance tool provides the necessary functionality to answer this question. This paper reports on the design and preliminary implementation of a tool which views each execution as a scientific experiment and provides the functionality to answer questions about a program's performance which span more than a single execution or environment. We report results of using our tool with an actual performance tuning study and with a scientific application run in changing environments. Our goal is to use historic program performance data to develop techniques for parallel program performance diagnosis.
Synthetic-Perturbation Techniques For Screening Shared Memory Programs
- SOFTWARE---PRACTICE AND EXPERIENCE
, 1994
"... this paper is to explain the general approach and to extend it to address specific features that are the main source of poor performance on the shared memory programming model. These include performance degradation due to load imbalance and insufficient parallelism, and overhead introduced by synchr ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
this paper is to explain the general approach and to extend it to address specific features that are the main source of poor performance on the shared memory programming model. These include performance degradation due to load imbalance and insufficient parallelism, and overhead introduced by synchronizations and by accessing shared data structures. We illustrate the practicality of SPS by demonstrating its use on two very different case studies: a large image understanding benchmark and a parallel quicksort
Symbolic Cache Analysis for Real-Time Systems
, 1999
"... Caches impose a major problem for predicting execution times of real-time systems since the cache behavior depends on the history of previous memory references. Too pessimistic assumptions on cache hits can result in worst-case execution time estimates that are prohibitive for real-time systems. ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Caches impose a major problem for predicting execution times of real-time systems since the cache behavior depends on the history of previous memory references. Too pessimistic assumptions on cache hits can result in worst-case execution time estimates that are prohibitive for real-time systems.
Predicting Application Behavior in Large Scale Shared-memory Multiprocessors
- In Supercomputing '95
, 1995
"... In this paper we present an analytical-based framework for parallel program performance prediction. The main thrust of this work is to provide a means for treating realistic applications within a single unified framework. Our approach is based upon the specification of a set of non-linear equations ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we present an analytical-based framework for parallel program performance prediction. The main thrust of this work is to provide a means for treating realistic applications within a single unified framework. Our approach is based upon the specification of a set of non-linear equations which describe the application, processor configuration, network and memory operations. These equations are solved iteratively since the application execution rate depends on the communication latencies. The iterative solution technique is found to be efficient as it typically requires only few iterations to reach convergence. Our modeling methodology achieves a good balance between abstraction and accuracy. This is attained by accounting for both time and space dimensions of memory references, while maintaining a simple description of the workload. We demonstrate both the practicality and the accuracy of our approach by comparing predicted results with measurements taken on a commercial mul...
Visual Assistance for Concurrent Processing
, 2000
"... ABSTRACT....................................................................... ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
ABSTRACT.......................................................................
LBF: A Performance Metric for Program Reorganization
, 1998
"... We introduce a new performance metric, called Load Balancing Factor (LBF), to assist programmers with evaluating different tuning alternatives. The LBF metric differs from traditional performance metrics since it is intended to measure the performance implications of a specific tuning alternative ra ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We introduce a new performance metric, called Load Balancing Factor (LBF), to assist programmers with evaluating different tuning alternatives. The LBF metric differs from traditional performance metrics since it is intended to measure the performance implications of a specific tuning alternative rather than quantifying where time is spent in the current version of the program. A second unique aspect of the metric is that it provides guidance about moving work within a distributed or parallel program rather than reducing it. A variation of the LBF metric can also be used to predict the performance impact of changing the underlying network. The LBF metric can be computed incrementally and online during the execution of the program to be tuned. We also present a case study that shows that our metric can predict the actual performance gains accurately for a test suite of six programs. 1. Introduction To successfully tune a distributed or parallel program, the cause of a performance bottl...
Finding Bottlenecks In Large Scale Parallel Programs
, 1994
"... This thesis addresses the problem of trying to locate the source of performance bottlenecks in large-scale parallel and distributed applications. Performance monitoring creates a dilemma: identifying a bottleneck necessitates collecting detailed information, yet collecting all this data can introduc ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This thesis addresses the problem of trying to locate the source of performance bottlenecks in large-scale parallel and distributed applications. Performance monitoring creates a dilemma: identifying a bottleneck necessitates collecting detailed information, yet collecting all this data can introduce serious data collection bottlenecks. At the same time, users are being inundated with volumes of complex graphs and tables that require a performance expert to interpret. I have developed a new approach that addresses both these problems by combining dynamic on-the-fly selection of what performance data to collect with decision support to assist users with the selection and presentation of performance data. The approach is called the W 3 Search Model. To make it possible to implement the W 3 Search Model, I have developed a new monitoring technique for parallel programs called Dynamic Instrumentation. The premise of my work is that not only is it possible to do on-line performance debu...
Experiment Management Support for Parallel Performance Tuning
- UNIVERSITY OF WISCONSIN - MADISON
, 1999
"... ..."
Mapping Performance Data for High-Level and Data Views of Parallel Program Performance
, 1995
"... Programs written in high-level parallel languages need profiling tools that provide performance data in terms of the semantics of the high-level language. But high-level performance data can be incomplete when the cause of a performance problem cannot be explained in terms of the semantics of the la ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Programs written in high-level parallel languages need profiling tools that provide performance data in terms of the semantics of the high-level language. But high-level performance data can be incomplete when the cause of a performance problem cannot be explained in terms of the semantics of the language. We also need the ability to view the performance of the underlying mechanisms used by the language and correlate the underlying activity to the language source code. The key techniques for providing these performance views is the ability to map low-level performance data up to the language abstractions. We identify the various kinds of mapping information that needs to be gathered to support multiple views of performance data and describe how we can mine mapping information from the compiler and run-time environment. We also describe how we use this information to produce performance data at the higher levels, and how we present this data in terms of both the code and parallel data s...

