Results 1 
9 of
9
Towards a theory of cacheefficient algorithms
 PROCEEDINGS OF THE SYMPOSIUM ON DISCRETE
, 2000
"... We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships betw ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cacheefficient algorithms in the singlelevel cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the averagecase cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
Cacheoblivious algorithms (Extended Abstract)
 In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cacheline length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cacheline length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an npoint FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “idealcache ” model to analyze our algorithms. We prove that an optimal cacheoblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the idealcache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cacheoblivious algorithms in practice.
Memory Hierarchy Considerations for Fast Transpose and BitReversals
 In Proceedings of HPCS 5
, 1999
"... This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bitreversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds tha ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bitreversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bitreversals. Insight gained from our analysis leads to the design of a near optimal bitreversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms. Copyright 1998 IEEE. Published in the Proceedings of HPCA 5, 913 January 1999 in Orlando, FL. Personal use of this material is permi...
Portable High Performance Programming via ArchitectureCognizant DivideandConquer Algorithms
, 2000
"... ...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. What You Can Expect . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. DivideandConquer Algorithms for Performance Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. The Importance of ArchitectureCognizance . . . . . . . . . 7 3. Complexity of Determining VariantPolicy . . . . . . . . . . 7 4. A Framework and System for DivideandConquer Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. The Fastest Portable FFT Algorithm . . . . . . . . . . . . . 8 6. Outline of Thesis . . . . . . . . . . . . . . . . ....
P.: Combining analytical and empirical approaches in tuning matrix transposition
 In: Proc. Parallel Architectures and Compilation Techniques (PACT
, 2006
"... Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated opti ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code the best version is chosen at runtime, with assistance from minimaloverhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework. Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—code generation; compilers; optimization
Towards a Theory of CacheEfficient Algorithms ∗
, 2000
"... We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships between the ..."
Abstract
 Add to MetaCart
We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cacheoptimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the singlelevel cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the averagecase cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity. 1
Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry
"... Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties ..."
Abstract
 Add to MetaCart
Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization
Programming and compiling for embedded SIMD architectures
, 2008
"... The dissertation is submitted ..."