Results 1  10
of
56
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. ..."
Abstract

Cited by 78 (5 self)
 Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by reordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (25% of total running time) and high performance benefits (reducing execution time by factors of 1.12.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursionbased control structures may be needed to fully exploit their potential.
Cacheoblivious priority queue and graph algorithm applications
 In Proc. 34th Annual ACM Symposium on Theory of Computing
, 2002
"... In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hi ..."
Abstract

Cited by 68 (9 self)
 Add to MetaCart
In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cacheoblivious data structure, M and B are not used in the description of the structure. The bounds match the bounds of several previously developed externalmemory (cacheaware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known externalmemory graph algorithms, and using our cacheoblivious priority queue we develop several cacheoblivious graph algorithms.
A memory model for scientific algorithms on graphics processors
 in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract

Cited by 67 (4 self)
 Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memoryintensive scientific applications – sorting, fast Fourier transform and dense matrixmultiplication. In practice, our cacheefficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPUbased and CPUbased implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
Cacheoblivious mesh layouts
 ACM Trans. Graph
, 2005
"... ACM acknowledges that this contribution was authored or coauthored by a contractor of affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royaltyfree right to publish or reproduce this article, or to allow others to do so, for Government purposes only. ..."
Abstract

Cited by 51 (11 self)
 Add to MetaCart
ACM acknowledges that this contribution was authored or coauthored by a contractor of affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royaltyfree right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Optimizing Graph Algorithms for Improved Cache Performance
 In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL
, 2002
"... In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the it ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processormemory traffic of Ω(N 3 / C), where N and C are the problem size and cache size respectively. Experimental results show that this cacheoblivious implementation shows more than 6× improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three stateoftheart architectures. Secondly, we address Dijkstra's algorithm for the singlesource shortest paths problem and Prim's algorithm for Minimum Spanning Tree. For these algorithms, we demonstrate up to 2 × improvement in real execution time by using a simple cachefriendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of 2 × 3 × in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and then solving the whole problem using the suboptimal solution as a starting point.
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
Scanning and traversing: maintaining data for traversals in a memory hierarchy
 In Proceedings of the 10th Annual European Symposium on Algorithms
, 2002
"... Abstract. We study the problem of maintaining a dynamic ordered set subject to insertions, deletions, and traversals of k consecutive elements. This problem is trivially solved on a RAM and on a simple twolevel memory hierarchy. We explore this traversal problem on more realistic memory models: the ..."
Abstract

Cited by 29 (11 self)
 Add to MetaCart
Abstract. We study the problem of maintaining a dynamic ordered set subject to insertions, deletions, and traversals of k consecutive elements. This problem is trivially solved on a RAM and on a simple twolevel memory hierarchy. We explore this traversal problem on more realistic memory models: the cacheoblivious model, which applies to unknown and multilevel memory hierarchies, and sequentialaccess models, where sequential block transfers are less expensive than random block transfers. 1
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
The Cost of CacheOblivious Searching
, 2003
"... Tight bounds on the cost of cacheoblivious searching are proved. It is shown that no cacheoblivious search structure can guarantee that a search performs fewer than lg e logB N block transfers between any two levels of the memory hierarchy. This lowerbound holds even if all of the block sizes ar ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
Tight bounds on the cost of cacheoblivious searching are proved. It is shown that no cacheoblivious search structure can guarantee that a search performs fewer than lg e logB N block transfers between any two levels of the memory hierarchy. This lowerbound holds even if all of the block sizes are limited to be powers of 2. A modified version of the van Emde Boas layout is proposed, whose expected block transfers between any two levels of the memory hierarchy arbitrarily close to [lg e + O(lg lg B / lg B)] logB N + O(1).This factor approaches lg e ss 1.443 as B increases. The expectation is taken over the random placement of the first element of the structure in memory. As searching in the Disk Access Model (DAM) can be performed in logB N + 1 block transfers, this result shows a separation between the 2level DAM and cacheoblivious memoryhierarchy models. By extending the DAM model to k levels, multilevel memoryhierarchies can be modelled. It is shown that as k grows, the search costs of the optimal klevel DAM search structure and of the optimal cacheoblivious search structure rapidlyconverge. This demonstrates that for a multilevel memory hierarchy, a simple cacheoblivious structure almost replicates the performance of an optimal parameterized klevel DAM structure.
Efficient Sorting Using Registers and Caches
 in Proceedings of the 4th Workshop on Algorithm Engineering (WAE 2000
, 2000
"... Modern computer systems have increasingly complex memory systems.Common machine models for algorithm analysis do not reflect many of the features... ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
Modern computer systems have increasingly complex memory systems.Common machine models for algorithm analysis do not reflect many of the features...