Results 1 
3 of
3
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
(Show Context)
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Mortonorder Matrices Deserve Compilers ’ Support
, 1999
"... A proof of concept is offered for the uniform representation of matrices serially in Mortonorder (or Zorder) representation, as well as their divideandconquer processing as quaternary trees. Generally, d dimensional arrays are accessed as 2 dary trees. This data structure is important because, ..."
Abstract
 Add to MetaCart
(Show Context)
A proof of concept is offered for the uniform representation of matrices serially in Mortonorder (or Zorder) representation, as well as their divideandconquer processing as quaternary trees. Generally, d dimensional arrays are accessed as 2 dary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, while the tree helps schedule multiprocessing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics on matrix multiplication, a critical example, show how the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without lowlevel tuning. Perhaps because of the early success of columnmajor representation with strength reduction, quadtree representation has been reinvented and redeveloped in areas far from the center that is Programming Languages. As target architectures move to multiprocessing, superscalar pipes, and hierarchical memories, compilers must support quadtrees better, so that more programmers invent algorithms that use them to exploit the hardware.