Results 1  10
of
168
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
, 2000
"... Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary drama ..."
Abstract

Cited by 108 (9 self)
 Add to MetaCart
Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several wellknown static techniques and show that we outperform each of them on a range of benchmarks across a variety of ar...
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract

Cited by 69 (4 self)
 Add to MetaCart
(Show Context)
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of nonconflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
A Framework for Performance Modeling and Prediction
 IN SC 2002
, 2002
"... Cycleaccurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on ..."
Abstract

Cited by 60 (7 self)
 Add to MetaCart
Cycleaccurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on yettobebuilt systems). Here we present a framework for performance modeling and prediction that is faster than cycleaccurate simulation, more informative than simple benchmarking, and is shown useful for performance investigations in several dimensions.
Statcache: A probabilistic approach to efficient and accurate data locality analysis
 In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2004
"... The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel samplingbased method for performing datalocality analysis on realisti ..."
Abstract

Cited by 59 (7 self)
 Add to MetaCart
(Show Context)
The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel samplingbased method for performing datalocality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fullyassociative caches of arbitrary sizes and generate workingset graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as �. We also provide a proofofconcept implementation, and discuss potentially very fast implementation alternatives. 1
Performance Optimizations and Bounds for Sparse MatrixVector Multiply
 In Proceedings of Supercomputing
, 2002
"... We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.
Modeling Application Performance by Convolving Machine Signatures with Application Profiles
, 2001
"... This paper presents a performance modeling methodology that is faster than traditional cycleaccurate simulation, more sophisticated than performance estimation based on system peakperformance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method ..."
Abstract

Cited by 50 (5 self)
 Add to MetaCart
This paper presents a performance modeling methodology that is faster than traditional cycleaccurate simulation, more sophisticated than performance estimation based on system peakperformance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method yields insight into the factors that affect performance on singleprocessor and parallel computers.
Data Cache Locking for Higher Program Predictability
 In SIGMETRICS ’03: Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
, 2003
"... ABSTRACT Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected. Cache locking mechanisms adapt caches to t ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
(Show Context)
ABSTRACT Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected. Cache locking mechanisms adapt caches to the needs of realtime systems. Locking the cache is a solution that trades performance for predictability: at a cost of generally lower performance, the time of accessing the memory becomes predictable. This paper combines compiletime cache analysis with data cache locking to estimate the worstcase memory performance (WCMP) in a safe, tight and fast way. In order to get predictable cache behavior, we first lock the cache for those parts of the code where the static analysis fails. To minimize the performance degradation, our method loads the cache, if necessary, with data likely to be accessed. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed program. When compared to an algorithm that assumes compulsory misses when the state of the cache is unknown, our approach eliminates all overestimation for the set of benchmarks, giving an exact WCMP of the transformed program without any significant decrease in performance.
Counting integer points in parametric polytopes using Barvinok’s rational functions
 Algorithmica
, 2007
"... Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric ..."
Abstract

Cited by 44 (9 self)
 Add to MetaCart
(Show Context)
Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasipolynomials each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasipolynomials, but this technique has several disadvantages. Its worstcase computation time for a single quasipolynomial is exponential in the input size, even for fixed dimensions. The worstcase size of such a quasipolynomial (measured in bits needed to represent the quasipolynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasipolynomials analytically. It extends an existing method, based on Barvinok’s decomposition,
Data caches in multitasking hard realtime systems
 IN IEEE REALTIME SYSTEMS SYMPOSIUM
, 2003
"... Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
(Show Context)
Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive multitasking systems in the presence of data caches. We explore the use of cache partitioning, dynamic cache locking and static cache analysis to provide worstcase performance estimates in a safe and tight way. Cache partitioning divides the cache among tasks to eliminate intertask cache interferences. We combine static cache analysis and cache locking mechanisms to ensure that all intratask conflicts, and consequently, memory access times, are exactly predictable. To minimize the performance degradation due to cache partitioning and locking, two strategies are employed. First, the cache is loaded with data likely to be accessed so that their cache utilization is maximized. Second, compiler optimizations such as tiling and padding are applied in order to reduce cache replacement misses. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed programs. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, with a CPU utilization reduction ranging between 3.8 and 20.0 times for a high performance system.
Let’s Study WholeProgram Cache Behaviour Analytically
 In Proceedings of International Symposium on HighPerformance Computer Architecture (HPCA 8
, 2002
"... ..."
(Show Context)