Results 1 
9 of
9
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing ..."
Abstract

Cited by 77 (6 self)
 Add to MetaCart
(Show Context)
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Matrix Algebra and Applicative Programming
 Functional Programming Languages and Computer Architecture (Proceedings
, 1987
"... General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional programming. Akey question is how arrays should be represented in order to admit good implementations of wellknown e cient algorithms, and whether functional architecture sheds any new ligh ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional programming. Akey question is how arrays should be represented in order to admit good implementations of wellknown e cient algorithms, and whether functional architecture sheds any new light on these or other solutions. It relates directly to disarming the \aggregate update &quot; problem. The major thesis is that 2 dary trees should be used to represent ddimensional arrays � examples are matrix operations (d = 2), and a particularly interesting vector (d = 1) algorithm. Sparse and dense matrices are represented homogeneously, but at some overhead that appears tolerable � encouraging results are reviewed and extended. A Pivot Step algorithm is described which o ers optimal stability at no extra cost for searching. The new results include proposed sparseness measures for matrices, improved performance of stable matrix inversion through repeated pivoting while deep within a matrixtree (extendible to solving linear systems), and a clean matrix derivation of the vector algorithm for the fast Fourier transform. Running code is o ered in the appendices.
Seven at one stroke: Results from a cacheoblivious paradigm for scalable matrix algorithms
 In MSPC ’06: Proc. 2006 Wkshp. Memory System Performance and Correctness
, 2006
"... A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy a ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cacheoblivious style of programming. Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on lowlevel performance, including the new Mortonhybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisively on TLBmisses.
Matrix Algorithms using Quadtrees
 IN PROC. ATABLE92
, 1992
"... Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: essentially how to implement FORTRAN arrays. This situation is strange because inplace updating of aggregates belongs more to uniprocessing than to mathematics. Several years ago functional style drew me to treatment of ddimensional arrays as 2^dary trees; in particular, matrices become quaternary trees or quadtrees. This convention yields efficient recopyingcumupdate of any array; recursive, algebraic decomposition of conventional arithmetic algorithms; and uniform representations and algorithms for both dense and sparse matrices. For instance, any nonsingular subtree is a candidate as the pivot block for Gaussian elimination; the restriction actually helps identification of pivot b...
UndulantBlock Elimination and IntegerPreserving Matrix Inversion 1
, 1995
"... 1 c 1994,1995 by the author. This work has been accepted for publication by Science ..."
Abstract
 Add to MetaCart
(Show Context)
1 c 1994,1995 by the author. This work has been accepted for publication by Science
Mortonorder Matrices Deserve Compilers ’ Support
, 1999
"... A proof of concept is offered for the uniform representation of matrices serially in Mortonorder (or Zorder) representation, as well as their divideandconquer processing as quaternary trees. Generally, d dimensional arrays are accessed as 2 dary trees. This data structure is important because, ..."
Abstract
 Add to MetaCart
(Show Context)
A proof of concept is offered for the uniform representation of matrices serially in Mortonorder (or Zorder) representation, as well as their divideandconquer processing as quaternary trees. Generally, d dimensional arrays are accessed as 2 dary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, while the tree helps schedule multiprocessing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics on matrix multiplication, a critical example, show how the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without lowlevel tuning. Perhaps because of the early success of columnmajor representation with strength reduction, quadtree representation has been reinvented and redeveloped in areas far from the center that is Programming Languages. As target architectures move to multiprocessing, superscalar pipes, and hierarchical memories, compilers must support quadtrees better, so that more programmers invent algorithms that use them to exploit the hardware.
Abstract AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
"... ..."
(Show Context)
Language Support for Mortonorder Matrices
"... The uniform representation of 2dimensional arrays serially in Morton order (or I order) supports both their iterative scan with cartesian indices and their divideandconquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latenc ..."
Abstract
 Add to MetaCart
(Show Context)
The uniform representation of 2dimensional arrays serially in Morton order (or I order) supports both their iterative scan with cartesian indices and their divideandconquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multiprocessing. Results here show howitfacilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a speci c runtime environment. We have built a rudimentary CtoC translator that implements matrices in Mortonorder from source that presumes a rowmajor implementation. Early performance from LAPACK's reference implementation of dgesv (linear solver), and all its supporting routines (including dgemm matrixmultiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in backend optimizers. We also present results from a more stylish dgemm algorithm that takes better advantage of this representation. With only routine backend optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding
UndulantBlock Elimination and IntegerPreserving Matrix Inversion
, 1995
"... A new formulation for $LU$ decomposition allows efficient representation of intermediate matrices while eliminating blocks of various sizes, i.e. during ``undulantblock'' elimination. Its efficiency arises from its design for block encapsulization, implicit in data structures that are con ..."
Abstract
 Add to MetaCart
(Show Context)
A new formulation for $LU$ decomposition allows efficient representation of intermediate matrices while eliminating blocks of various sizes, i.e. during ``undulantblock'' elimination. Its efficiency arises from its design for block encapsulization, implicit in data structures that are convenient both for process scheduling and for memory management. Row/column permutations that can destroy such encapsulizations are deferred. Its algorithms, expressed naturally as functional programs, are well suited to parallel and distributed processing. A given matrix, $A$ is decomposed into two matrices (in the space of just one), plus two permutations. The permutations, $P$ and $Q$, are the row/column rearrangements usual to complete pivoting. %(one of which is $I$ under partial pivoting). The principal results are $L$ and $U'$, where $L$ is properlylower quasitriangular; $U'$ is upper quasitriangular with its quasidiagonal being the inverse of that of $U$ from the usual factorization ($PAQ = (IL)U$), and its proper upper portion identical to $U$. The matrix result is $L+U'$. Algorithms for solving linear systems and matrix inversion follow directly. An example of a motivating data structure, the quadtree representation for matrices, is reviewed. Candidate pivots for Gaussian elimination under that structure are the subtrees, both constraining and assisting the pivot search, as well as decomposing to independent block/tree operations. %block operations decompose nicely there. The elementary algorithms are provided, coded in {\sc Haskell}. Finally, an integerpreserving version is presented replacing Bareiss's algorithm with a parallel equivalent. The decomposition of an integer matrix $A$ to integer matrices $\bar L$, $\bar U'$, and $d$ $=\det A$ follows $L+U'$ decomposition, but the followon algorithm to compute $dA^{1}$ is complicated by the requirement to maintain minimal denominators at every step and to avoid divisions, restricting them to necessarily exact ones.