Results 1  10
of
53
Cacheoblivious algorithms
, 1999
"... requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on ..."
Abstract

Cited by 90 (1 self)
 Add to MetaCart
(Show Context)
requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z = (L2), the number of cache misses for an m x n matrix transpose is E(1 + mn/L). The number of cache misses for either an npoint FFT or the sorting of n numbers is 0(1 + (n/L)(1 + logzn)). The cache complexity of computing n time steps of a Jacobistyle multipass filter on an array of size n is E(1 + n/L + n2 /ZL). We also give an 8(mnp)work algorithm to multiply an m x n matrix by an n x p matrix
Lugpu: Efficient algorithms for solving dense linear systems on graphics hardware
 in SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing
, 2005
"... We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the com ..."
Abstract

Cited by 79 (4 self)
 Add to MetaCart
We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the computation to utilize multiple vertex and fragment processors. We also use appropriate data representations to match the rasterization order and cache technology of graphics processors. We have implemented our algorithm on different GPUs and compared the performance with optimized CPU implementations. In particular, our implementation on a NVIDIA GeForce 7800 GPU outperforms a CPUbased ATLAS implementation. Moreover, our results show that our algorithm is cache and bandwidth efficient and scales well with the number of fragment processors within the GPU and the core GPU clock rate. We use our algorithm for fluid flow simulation and demonstrate that the commodity GPU is a useful coprocessor for many scientific applications. 1
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. ..."
Abstract

Cited by 78 (5 self)
 Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by reordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (25% of total running time) and high performance benefits (reducing execution time by factors of 1.12.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursionbased control structures may be needed to fully exploit their potential.
Global Static Indexing for Realtime Exploration of Very Large Regular Grids
, 2001
"... In this paper we introduce a new indexing scheme for progressive traversal and visualization of large regular grids. We demonstrate the potential of our approach by providing a tool that displays at interactive rates planar slices of scalar field data with very modest computing resources. We obtain ..."
Abstract

Cited by 63 (9 self)
 Add to MetaCart
(Show Context)
In this paper we introduce a new indexing scheme for progressive traversal and visualization of large regular grids. We demonstrate the potential of our approach by providing a tool that displays at interactive rates planar slices of scalar field data with very modest computing resources. We obtain unprecedented results both in terms of absolute performance and, more importantly, in terms of scalability. On a laptop computer we provide real time interaction with a 2048 3 grid (8 Giganodes) using only 20MB of memory. On an SGI Onyx we slice interactively an 8192 3 grid ( teranodes) using only 60MB of memory. The scheme relies simply on the determination of an appropriate reordering of the rectilinear grid data and a progressive construction of the output slice. The reordering minimizes the amount of I/O performed during the outofcore computation. The progressive and asynchronous computation of the output provides flexible quality/speed tradeoffs and a timecritical and interruptible user interface. 1.
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
The CacheOblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
, 2009
"... We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation o ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cacheoblivious IGEP computes inplace and solves most of the known applications of GEP including Gaussian elimination and LUdecomposition without pivoting and FloydWarshall allpairs shortest paths. Cacheoblivious CGEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both IGEP and CGEP produce systemindependent cacheefficient code, and are potentially applicable to being used by optimizing compilers for loop transformation. We present parallel IGEP and CGEP that achieve good speedup and match the sequential caching performance cacheobliviously for both shared and distributed caches for sufficiently large inputs. We present extensive experimental results for both incore and outofcore performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finelytuned cacheaware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cacheoblivious GEP offers an attractive tradeoff between efficiency and portability.
Portable HighPerformance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
right notice and this permission notice are preserved on all copies.
RKleene: A HighPerformance DivideandConquer Algorithm for the AllPair Shortest Path for Densely Connected Networks
, 2007
"... We propose a novel divideandconquer algorithm for the solution of the allpair shortestpath problem for directed and dense graphs with no negative cycles. We propose RKleene, a compact and inplace recursive algorithm inspired by Kleene’s algorithm. RKleene delivers a better performance than p ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We propose a novel divideandconquer algorithm for the solution of the allpair shortestpath problem for directed and dense graphs with no negative cycles. We propose RKleene, a compact and inplace recursive algorithm inspired by Kleene’s algorithm. RKleene delivers a better performance than previous algorithms for randomly generated graphs represented by highly dense adjacency matrices, in which the matrix components can have any integer value. We show that RKleene, unchanged and without any machine tuning, yields consistently between 1/7 and 1/2 of the peak performance running on five very different uniprocessor systems.