Results 1 
4 of
4
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inpla ..."
Abstract

Cited by 76 (6 self)
 Add to MetaCart
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Matrix Algebra and Applicative Programming
 Functional Programming Languages and Computer Architecture (Proceedings
, 1987
"... General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional programming. Akey question is how arrays should be represented in order to admit good implementations of wellknown e cient algorithms, and whether functional architecture sheds any new ligh ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional programming. Akey question is how arrays should be represented in order to admit good implementations of wellknown e cient algorithms, and whether functional architecture sheds any new light on these or other solutions. It relates directly to disarming the \aggregate update " problem. The major thesis is that 2 dary trees should be used to represent ddimensional arrays � examples are matrix operations (d = 2), and a particularly interesting vector (d = 1) algorithm. Sparse and dense matrices are represented homogeneously, but at some overhead that appears tolerable � encouraging results are reviewed and extended. A Pivot Step algorithm is described which o ers optimal stability at no extra cost for searching. The new results include proposed sparseness measures for matrices, improved performance of stable matrix inversion through repeated pivoting while deep within a matrixtree (extendible to solving linear systems), and a clean matrix derivation of the vector algorithm for the fast Fourier transform. Running code is o ered in the appendices.
Seven at one stroke: Results from a cacheoblivious paradigm for scalable matrix algorithms
 In MSPC ’06: Proc. 2006 Wkshp. Memory System Performance and Correctness
, 2006
"... A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy a ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cacheoblivious style of programming. Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on lowlevel performance, including the new Mortonhybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisively on TLBmisses.
Matrix Algorithms using Quadtrees Invited Talk, ATABLE92
 in Proc. ATABLE92
, 1992
"... Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: esse ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: essentially how to implement FORTRAN arrays. This situation is strange because inplace updating of aggregates belongs more to uniprocessing than to mathematics. Several years ago functional style drew me to treatment of ddimen sional arrays as 2 d ary trees; in particular, matrices become quaternary trees or quadtrees. This convention yields efficient recopyingcumupdate of any array; recursive, algebraic decomposition of conventional arithmetic algorithms; and uniform representations and algorithms for both dense and sparse matrices. For instance, any nonsingular subtree is a candidate as the pivot block for Gaussian elimination; the restriction actually helps identification of pivot b...