Results 11  20
of
27
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract

Cited by 13 (11 self)
 Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.
Cacheoblivious algorithms (Extended Abstract)
 In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cacheline length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cacheline length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an npoint FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “idealcache ” model to analyze our algorithms. We prove that an optimal cacheoblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the idealcache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cacheoblivious algorithms in practice.
Cache oblivious algorithms
 Algorithms for Memory Hierarchies, LNCS 2625
, 2003
"... Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data st ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “idealcache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPISaarbrücken. The
Towards an Optimal BitReversal Permutation Program
 In Proceeding of IEEE Foundations of Computer Science
, 1998
"... The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present nontrivial problems when designing highlytuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebbletype game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with fourlevels of memory (registers, cache, TLB, and memory), taking into account such "messy" features as block moves and setassociative caches. The insights from this analysis lead to a bitreversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algo...
Memory Hierarchy Considerations for Fast Transpose and BitReversals
 In Proceedings of HPCS 5
, 1999
"... This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bitreversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds tha ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bitreversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bitreversals. Insight gained from our analysis leads to the design of a near optimal bitreversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms. Copyright 1998 IEEE. Published in the Proceedings of HPCA 5, 913 January 1999 in Orlando, FL. Personal use of this material is permi...
Cacheoblivious algorithms and data structures
 IN SWAT
, 2004
"... Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as stand ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Frigo, Leiserson, Prokop and Ramachandran in 1999 introduced the idealcache model as a formal model of computation for developing algorithms in environments with multiple levels of caching, and coined the terminology of cacheoblivious algorithms. Cacheoblivious algorithms are described as standard RAM algorithms with only one memory level, i.e. without any knowledge about memory hierarchies, but are analyzed in the twolevel I/O model of Aggarwal and Vitter for an arbitrary memory and block size and an optimal offline cache replacement strategy. The result are algorithms that automatically apply to multilevel memory hierarchies. This paper gives an overview of the results achieved on cacheoblivious algorithms and data structures since the seminal paper by Frigo et al.
A unified model for multicore architectures
 In Proc. 1st International Forum on NextGeneration Multicore/Manycore Technologies
, 2008
"... With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of caches at different levels. We propose a unified memory hierarchy model that addresses these limitations and is an extension of the MHG model developed for a single processor with multimemory hierarchy. We demonstrate that our unified framework can be applied to a number of multicore architectures for a variety of applications. In particular, we derive lower bounds on memory traffic between different levels in the hierarchy for financial and scientific computations. We also give a multicore algorithms for a financial
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
An Optimal CacheOblivious Priority Queue and its Application to Graph Algorithms
 SIAM JOURNAL ON COMPUTING
, 2007
"... We develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in $O(\frac{1}{B}\log_{M/B}\frac{N}{B})$ amortized memory transfers, where $M$ and $B$ are the memory and block transfer sizes of any two consecutive levels of a multilevel ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in $O(\frac{1}{B}\log_{M/B}\frac{N}{B})$ amortized memory transfers, where $M$ and $B$ are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cacheoblivious data structure, $M$ and $B$ are not used in the description of the structure. Our structure is as efficient as several previously developed external memory (cacheaware) priority queue data structures, which all rely crucially on knowledge about $M$ and $B$. Priority queues are a critical component in many of the best known external memory graph algorithms, and using our cacheoblivious priority queue we develop several cacheoblivious graph algorithms.
Strong i/o lower bounds for binomial and fft computation graphs
, 2010
"... Abstract. Processors on most of the modern computing devices have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we propose a new technique, the boun ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. Processors on most of the modern computing devices have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we propose a new technique, the boundary flow technique, for deriving lower bounds on the memory traffic complexity of problems in multilevel memory hierarchy architectures. The boundary flow technique relies on identifying subcomputation structure corresponding to equal computations with a minimum number of boundary vertices, which in turn is related to the vertex isoperimetric parameter of a computation graph. We demonstrate that this technique results in stronger lower bounds for memory traffic on memory hierarchy architectures for wellknown computation structures: the binomial computation graphs and FFT computation graphs. For binomial computation we improve the lower bound by a factor of three. This reduces the gap between the lower and upper bound from a factor of 4 to a factor of 4/3. For FFT computation, past work has mostly focused on asymptotic lower bounds. We improve the best known previous lower bound for FFT computation by a factor of 8. The lower bound established is almost optimal as there exists a simple FFT algorithm that nearly achieves this bound. 1