## Recursive Array Layouts and Fast Parallel Matrix Multiplication (1999)

### Cached

### Download Links

- [www.cs.duke.edu]
- [ftp.cs.unc.edu]
- [ftp.cs.unc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures |

Citations: | 49 - 4 self |

### BibTeX

@INPROCEEDINGS{Chatterjee99recursivearray,

author = {Siddhartha Chatterjee and Alvin R. Lebeck and Praveen K. Patnala and Mithuna Thottethodi},

title = {Recursive Array Layouts and Fast Parallel Matrix Multiplication},

booktitle = {In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures},

year = {1999},

pages = {222--231}

}

### Years of Citing Articles

### OpenURL

### Abstract

Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2--2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;...

### Citations

876 |
Accuracy and Stability of Numerical Algorithms
- Higham
(Show Context)
Citation Context ...result of the increased number of memory accesses. We answer this question in Section 5. We do not discuss in this paper numerical issues concerning the fast algorithms, as they are covered elsewhere =-=[17]-=-. We use Cilk [4] to implement parallel versions of these algorithms. Parallelism is exposed in the recursive matrix multiplication calls. The seven or eight calls are spawned in parallel, and these i... |

758 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...partment of Computer Science, Duke University, Durham, NC 27708-0129. Email: falvy,mithunag@cs.duke.edu nel in linear algebraic algorithms, and is enshrined in the dgemm routine in the BLAS 3 library =-=[10]-=-. There is an intimate relationship between the layout of the arrays in memory and the performance of the routine. On modern shared-memory multiprocessors with multi-level memory hierarchies, the colu... |

721 | A Data Locality Optimizing Algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...fine-grained computations. Parallel languages and compilers. The parallel compiler literature contains much work on iteration space tiling for gaining parallelism [41] and improving cache performance =-=[5, 40]-=-. Carter et al. [6] discuss hierarchical tiling schemesfor a hierarchical shared memory model. Lam, Rothberg, and Wolf [26] discuss the importance of cache optimizations for blocked algorithms. A majo... |

567 |
Parallel Computer Architecture: A Hardware/Software Approach
- Culler, Singh, et al.
- 1999
(Show Context)
Citation Context ...spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing those quadrants. This leads to false sharing =-=[9]-=-. In a message-passing parallel environment such as those used in implementations of High Performance Fortran [25], typical array distributions would again spread a matrix quadrant over many processor... |

565 | Cilk: An Efficient Multithreaded Runtime System
- Blumofe, Joerg, et al.
- 1995
(Show Context)
Citation Context ...plementors wishing to incorporate such layout functions into fine-grained parallel computations. Finally, as a side effect, we provide an evaluation of the strengths and weaknesses of the Cilk system =-=[4]-=-, which we used to parallelize our code. The remainder of this paper is organized as follows. Section 2 introduces the algorithms for fast matrix multiplication that we discuss in this paper. Section ... |

521 | The cache performance and optimizations of blocked algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...tly in the control structure of the program? ffl Frens and Wise carried out their quad-tree layout of matrices down to the level of matrix elements, disregarding the result of Lam, Rothberg, and Wolf =-=[26]-=- that a tile that fits in cache and is contiguous in memory space can be organized in a canonical order without compromising locality of reference. This suggests the need for the quadtree decompositio... |

469 | FFTW: An adaptive software architecture for the FFT
- Frigo, Johnson
- 1381
(Show Context)
Citation Context ...ng for parallelism. Scientific libraries Several projects emphasize the generation of self-tuning libraries for specific problems. We are aware of three such efforts: PHiPAC [3], ATLAS [39], and FFTW =-=[13]-=-. The 0 1 2 3 4 5 Number of processors 0.0 20.0 40.0 60.0 80.0 100.0 120.0 Standard, N=1000 Column major Z-Morton X-Morton U-Morton G-Morton Hilbert 0 1 2 3 4 5 Number of processors 0.0 5.0 10.0 15.0 ... |

387 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ...Wise speculated about the "attractive hybrid composed of Strassen's recurrence and this one" [12, p. 215]. This is an interesting variation on the problem for two reasons. First, Strassen's =-=algorithm [36]-=- achieves a lower operation count at the expense of more data accesses in patterns that have worse algorithmic locality of reference. Second, the control structure of Strassen's algorithm is not as si... |

387 | Automatically tuned linear algebra software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ...tion space tiling for parallelism. Scientific libraries Several projects emphasize the generation of self-tuning libraries for specific problems. We are aware of three such efforts: PHiPAC [3], ATLAS =-=[39]-=-, and FFTW [13]. The 0 1 2 3 4 5 Number of processors 0.0 20.0 40.0 60.0 80.0 100.0 120.0 Standard, N=1000 Column major Z-Morton X-Morton U-Morton G-Morton Hilbert 0 1 2 3 4 5 Number of processors 0.0... |

295 |
Evaluating Associativity in CPU Caches
- Hill, Smith
- 1989
(Show Context)
Citation Context ...ifier "algorithmic" to emphasize the point that we are reasoning about this issue at an algorithmic level, independent of the architecture of the underlying memory hierarchy. In terms of the=-= 3C model [19]-=- of cache misses, we are reasoning about capacity misses at a high level, not about conflict misses. Post-additions Recursive calls Pre-additions Elts of A read to compute C Elts of B read to compute ... |

279 |
Space-Filling Curves
- Sagan
- 1994
(Show Context)
Citation Context ...s a small array tile. The above considerations have led authors to investigate a class of array layout functions variously described as being based either on quadtrees [12] or on space-filling curves =-=[18, 32, 34]-=-. Instances of this family are familiar in parallel computing under the names Morton ordering and Hilbert ordering. Despite the dilation effect described above, canonical layout functions have one maj... |

232 | Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology
- Bilmes, Asanović, et al.
- 1997
(Show Context)
Citation Context ...n and iteration space tiling for parallelism. Scientific libraries Several projects emphasize the generation of self-tuning libraries for specific problems. We are aware of three such efforts: PHiPAC =-=[3]-=-, ATLAS [39], and FFTW [13]. The 0 1 2 3 4 5 Number of processors 0.0 20.0 40.0 60.0 80.0 100.0 120.0 Standard, N=1000 Column major Z-Morton X-Morton U-Morton G-Morton Hilbert 0 1 2 3 4 5 Number of pr... |

194 |
clustering of objects with multiple attributes
- Linear
- 1990
(Show Context)
Citation Context ... specific application domains [1, 20, 21, 33, 35, 38]. They have also been applied for bandwidth reduction in information theory [2], for graphics applications [14, 27], and for database applications =-=[22]-=-. Most of these applications have far coarser granularity than our test codes. We have shown that the overheads of these layouts can be reduced enough to make them useful for fine-grained computations... |

192 | Compiler optimizations for improving data locality
- Carr, Kennedy, et al.
- 1992
(Show Context)
Citation Context ...fine-grained computations. Parallel languages and compilers. The parallel compiler literature contains much work on iteration space tiling for gaining parallelism [41] and improving cache performance =-=[5, 40]-=-. Carter et al. [6] discuss hierarchical tiling schemesfor a hierarchical shared memory model. Lam, Rothberg, and Wolf [26] discuss the importance of cache optimizations for blocked algorithms. A majo... |

185 |
More Iteration Space Tiling
- Wolfe
- 1989
(Show Context)
Citation Context ...duced enough to make them useful for fine-grained computations. Parallel languages and compilers. The parallel compiler literature contains much work on iteration space tiling for gaining parallelism =-=[41]-=- and improving cache performance [5, 40]. Carter et al. [6] discuss hierarchical tiling schemesfor a hierarchical shared memory model. Lam, Rothberg, and Wolf [26] discuss the importance of cache opti... |

159 | ªUnifying Data and Control Transformations for Distributed Shared Memory
- Cierniak, Li
- 1995
(Show Context)
Citation Context ...) = m \Delta j + i. We do not consider the vector-of-vector layout used for non-constant arrays in C, because these are not true multi-dimensional arrays. Following the terminology of Cierniak and Li =-=[8]-=-, we refer to LR and LC as canonical layout functions. Figure 2(a) and (b) show these two layout functions. Canonical layouts do not always interact well with cache memories, because the layout functi... |

153 | A parallel hashed oct-tree n-body algorithm
- Warren, Salmon
- 1993
(Show Context)
Citation Context ...s a means of delivering high and robust performance for parallel dense linear algebra. The use of quad- or oct-trees (or, in a dual interpretation, spacefilling curves) is known in parallel computing =-=[1, 20, 21, 33, 35, 38]-=- for improving both load balance and locality. The computations thus parallelized or restructured are reasonably coarse-grained, thus making the overheads of maintaining and accessing the data structu... |

103 | Automatic Data Partitioning on Distributed Memory Multiprocessors
- Gupta, Banerjee
- 1990
(Show Context)
Citation Context ...optimization of arrays. Representative work includes that of Mace [29] for vector machines; of various authors investigating automatic array alignment and distribution for distributed memory machines =-=[7, 15, 23, 24]-=-; and of Cierniak and Li [8] for DSM environments. The last paper also recognizes the importance of joint control and data optimization. 7 Conclusions We have examined the combination of five recursiv... |

90 |
Sur une Courbe, Qui Remplit Toute une Aire Plaine
- Peano
(Show Context)
Citation Context ...s a small array tile. The above considerations have led authors to investigate a class of array layout functions variously described as being based either on quadtrees [12] or on space-filling curves =-=[18, 32, 34]-=-. Instances of this family are familiar in parallel computing under the names Morton ordering and Hilbert ordering. Despite the dilation effect described above, canonical layout functions have one maj... |

76 | Auto-blocking matrix-multiplication or tracking blas3 performance from source code
- FRENS, WISE
- 1997
(Show Context)
Citation Context ...nd locality. The computations thus parallelized or restructured are reasonably coarse-grained, thus making the overheads of maintaining and accessing the data structures insignificant. Frens and Wise =-=[12]-=- champion the use of quadtrees to represent matrices and explore its use in matrix multiplication. While the authors presented some excellent ideas, we felt that they carried them to an extreme. Based... |

60 | Dynamic partitioning of non-uniform structured workloads with space¯lling curves
- Pilkington, Baden
- 1996
(Show Context)
Citation Context ...s a means of delivering high and robust performance for parallel dense linear algebra. The use of quad- or oct-trees (or, in a dual interpretation, spacefilling curves) is known in parallel computing =-=[1, 20, 21, 33, 35, 38]-=- for improving both load balance and locality. The computations thus parallelized or restructured are reasonably coarse-grained, thus making the overheads of maintaining and accessing the data structu... |

59 |
Space-Filling Curves: Their Generation and Their Application to Bandwidth Reduction
- Bially
- 1969
(Show Context)
Citation Context ... Figure 2(g) illustrates this layout. The Sfunction for this layout is computationally more complex than any of the others. The fastest method we know of is based on an informal description by Bially =-=[2]-=-, which works by driving a finite state machine with pairs of bits from i and j, delivering two bits of S(i;j) at each step. 3.4 Summary We have described five recursive layout functions in terms of s... |

50 | Optimal Evaluation of Array Expressions on Massively Parallel Machines
- Chatterjee, Gilbert, et al.
- 1995
(Show Context)
Citation Context ...optimization of arrays. Representative work includes that of Mace [29] for vector machines; of various authors investigating automatic array alignment and distribution for distributed memory machines =-=[7, 15, 23, 24]-=-; and of Cierniak and Li [8] for DSM environments. The last paper also recognizes the importance of joint control and data optimization. 7 Conclusions We have examined the combination of five recursiv... |

41 | Hierarchical tiling for improved superscalar performance
- Carter, Ferrante, et al.
- 1995
(Show Context)
Citation Context ...ns. Parallel languages and compilers. The parallel compiler literature contains much work on iteration space tiling for gaining parallelism [41] and improving cache performance [5, 40]. Carter et al. =-=[6]-=- discuss hierarchical tiling schemesfor a hierarchical shared memory model. Lam, Rothberg, and Wolf [26] discuss the importance of cache optimizations for blocked algorithms. A major conclusion of the... |

41 |
Memory Storage Patterns in Parallel Processing
- Mace
- 1987
(Show Context)
Citation Context ...f computations) would fit within the HPF framework. A substantial body of work in the parallel computing literature deals with layout optimization of arrays. Representative work includes that of Mace =-=[29]-=- for vector machines; of various authors investigating automatic array alignment and distribution for distributed memory machines [7, 15, 23, 24]; and of Cierniak and Li [8] for DSM environments. The ... |

38 | Tuning Strassen’s matrix multiplication for memory hierarchies
- Thottethodi, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ... result in capacity misses). Since tiles are contiguous, there are no self-interference misses. This makes the performance of the leaf-level matrix multiplications almost insensitive to the tile size =-=[37]-=-. Our scheme is sensitive to the amount of padding, since it performs computations on the padded portions of the matrices. However, if we choose tile sizes from the range [Tmin ; Tmax ], the maximum r... |

32 |
Über stetige abbildung einer linie auf ein flächenstück. Mathematische Annalen 38, 459–460. HMETIS. http://www-users.cs.umn.edu/ karypis/metis/hmetis
- Hilbert
(Show Context)
Citation Context ...ilding them in the first place? ffl There are several variants of recursive orderings other than the U-ordering that Frens and Wise used. Some of these orderings, such as Gray-Morton [28] and Hilbert =-=[18]-=-, are supposedly better for load balancing, albeit at the expense of greater addressing overhead. How would these variants compare in terms of complexity vs. performance improvement for fine-grained p... |

32 | High Performance Fortran for highly irregular problems
- Hu, Johnsson, et al.
- 1997
(Show Context)
Citation Context ...s a means of delivering high and robust performance for parallel dense linear algebra. The use of quad- or oct-trees (or, in a dual interpretation, spacefilling curves) is known in parallel computing =-=[1, 20, 21, 33, 35, 38]-=- for improving both load balance and locality. The computations thus parallelized or restructured are reasonably coarse-grained, thus making the overheads of maintaining and accessing the data structu... |

28 | Balancing processor loads and exploiting data locality in n-body simulations
- Banicescu, Hummel
- 1995
(Show Context)
Citation Context |

27 |
The High Performance Fortran handbook. Scientific and engineering computation
- KOELBEL, LOVEMAN, et al.
- 1994
(Show Context)
Citation Context ...e written by the two processors computing those quadrants. This leads to false sharing [9]. In a message-passing parallel environment such as those used in implementations of High Performance Fortran =-=[25]-=-, typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs. The dilation effect can compromise single-node memory system performanc... |

27 | An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
- Singh, Joe, et al.
- 1993
(Show Context)
Citation Context |

21 |
Digital Design
- Mano
- 2002
(Show Context)
Citation Context ...pects of the recursive layouts. For any non-negative integer i, let B(i) be the bit string corresponding to its standard binary encoding, and let G(i) be the bit string corresponding to its Gray code =-=[30]-=- encoding. Correspondingly, for any bit string s, let B \Gamma1 (s) be the non-negative integer i such that B(i) = s, and let G \Gamma1 (s) be the non-negative integer i such that G(i) = s, Also, if u... |

20 |
Automatic data layout for distributed memory machines
- Kennedy, Kremer
- 1998
(Show Context)
Citation Context ...optimization of arrays. Representative work includes that of Mace [29] for vector machines; of various authors investigating automatic array alignment and distribution for distributed memory machines =-=[7, 15, 23, 24]-=-; and of Cierniak and Li [8] for DSM environments. The last paper also recognizes the importance of joint control and data optimization. 7 Conclusions We have examined the combination of five recursiv... |

16 | Load Balancing and Data Locality via Fractiling: An Experimental Study
- Hummel, Banicescu, et al.
- 1995
(Show Context)
Citation Context |

16 |
Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines
- Knobe, Lucas, et al.
- 1990
(Show Context)
Citation Context |

15 |
Optimizing raster storage: an examination of four alternatives
- Goodchild, Grandfield
- 1983
(Show Context)
Citation Context ... of the techniques have been tailored to specific application domains [1, 20, 21, 33, 35, 38]. They have also been applied for bandwidth reduction in information theory [2], for graphics applications =-=[14, 27]-=-, and for database applications [22]. Most of these applications have far coarser granularity than our test codes. We have shown that the overheads of these layouts can be reduced enough to make them ... |

10 |
Efficient procedures for using matrix algorithms
- Fischer, Probert
- 1974
(Show Context)
Citation Context ...on calls from eight to seven at the cost of 18 matrix additions/subtractions. This change reduces the operation count to O(n lg 7 ). The algorithm proceeds as shown in Figure 1(b). Winograd's variant =-=[11]-=- of Strassen's algorithm uses seven recursive matrix multiplication calls and 15 matrix additions/subtractions. It is known [11] that this is the minimum number of multiplications and additions possib... |

9 | Techniques for improving the data locality of iterative methods - Stals, Rüde - 1997 |

6 | Report of the working group on storage I/O for large-scale computing - Gibson, Vitter, et al. - 1996 |

5 |
Graphics Data Bases Built on Peano Space-Filling Curves. EUROGRAPHICS'85
- LAURINI
- 1985
(Show Context)
Citation Context ... of the techniques have been tailored to specific application domains [1, 20, 21, 33, 35, 38]. They have also been applied for bandwidth reduction in information theory [2], for graphics applications =-=[14, 27]-=-, and for database applications [22]. Most of these applications have far coarser granularity than our test codes. We have shown that the overheads of these layouts can be reduced enough to make them ... |

3 |
Analysis of the clustering properting of Hilbert space-filling curve
- Moon, Jagadish, et al.
- 1996
(Show Context)
Citation Context ...with several questions that we address in this paper. ffl Previous work using such layouts did not worry greatly about the overhead of address computations. The algorithms described in the literature =-=[31] follow fr-=-om the basic definitions and are not particularly optimized for performance. Are there fast algorithms, perhaps involving bit manipulation, for maintaining the "dope vectors" that would enab... |

2 |
Recursion leads to automatic variable blockingfor dense linearalgebra algorithms
- Gustavson
- 1997
(Show Context)
Citation Context ...wer than ours; it is difficult to tell whether this difference in performance is due to architectural improvements, to assumptions made in the algorithmic details, or to coding differences. Gustavson =-=[16]-=- discusses the role of recursive control strategies in automatic variable blocking of dense linear algebra codes, and shows dramatic performance gains compared to implementations of the same routines ... |

2 | Unifying data and control transformationsfor distributed shared-memory machines - Cierniak, Li - 1995 |

2 | Recursion leads to automaticvariable blockingfor dense linearalgebra algorithms - Gustavson - 1997 |