Results 1 - 10
of
24
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
A Comparison of Compiler Tiling Algorithms
, 1999
"... Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with exis ..."
Abstract
-
Cited by 50 (8 self)
- Add to MetaCart
Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100 % in some cases over a range of matrix sizes. Comparing the efficacy of different tiling algorithms, we discover rectangular tiles are slightly more efficient than square tiles. Overall, tiling improves performance from 0-250%. Copying tiles at run time proves to be quite effective.
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17--121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
- In Proc. of the Intl. Conf. on High Performance Computing
, 2001
"... The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contract ..."
Abstract
-
Cited by 28 (22 self)
- Add to MetaCart
The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry and computational physics. These computations are expressible as a set of tensor contractions and arise in electronic structure calculations. This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures. We focus on an approach to performing data locality optimization in this context. Preliminary experimental results on an SGI Origin 2000 are encouraging and demonstrate that the approach is effective.
Memory Characteristics of Iterative Methods
, 1999
"... Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that th ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the current DRAM technology cannot provide the data fast enough to keep the CPU busy. Although the fundamentals of cache optimizations are quite simple, current compilers cannot optimize even elementary iterative schemes. In this paper, we analyze the memory and cache behavior of iterative methods with extensive profiling and describe program transformation techniques to improve the cache performance of two- and three-dimensional multigrid algorithms. 1 Introduction Multigrid methods [11, 5] are among the most attractive algorithms for the solution of large sparse systems of equations that arise in the solution of elliptic partial differential equations (PDEs). However, even simple multi...
Loop Optimizations for a Class of Memory-Constrained Computations
, 2001
"... Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime trade-offs, are possible. By computing and storing some int ..."
Abstract
-
Cited by 21 (16 self)
- Add to MetaCart
Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different spacetime trade-offs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.
Locality Optimizations for Multi-Level Caches
- in Proceedings of SC99: High-Performance Networking and Computing
, 1999
"... Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformation ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance...
Software Support For Improving Locality in Scientific Codes
- In Proc. CPC2000
, 2000
"... We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We propose to develop and evaluate software support for improving locality for advanced scientific applications. We will investigate compiler and run-time techniques needed to achieve high performance on both sequential and parallel machines. We will focus on two areas. First, iterative PDE solvers for 3D partial differential equations have poor locality because accesses to nearby elements in higher-level dimensions are spread far apart in memory. Careful tiling and padding can frequently recapture such reuse. Second, computations on adaptive meshes and sparse matrices experience many cache misses because they access data in an irregular manner. Data layout and access order can be rearranged according to mesh connections or geometric location to improve locality, with cost models used to guide frequency of transformations for adaptive computations. 1 Introduction From astrophysics to biochemistry, scientists are increasingly using computers to conduct research and development. Fuelin...
Optimizing Program Locality through CMEs and GAs
- IN PROC. PACT
, 2003
"... Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. Performance of ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. Performance of
Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals
- In Proceedings of HPCS 5
, 1999
"... This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds tha ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bit-reversals. Insight gained from our analysis leads to the design of a near optimal bit-reversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms. Copyright 1998 IEEE. Published in the Proceedings of HPCA 5, 9-13 January 1999 in Orlando, FL. Personal use of this material is permi...

