## QR Factorization with Morton-Ordered Quadtree Matrices for Memory Re-use and Parallelism (2003)

Venue: | In Proc. 2003 ACM Symp. on Principles and Practice of Parallel Programming |

Citations: | 11 - 1 self |

### BibTeX

@INPROCEEDINGS{Frens03qrfactorization,

author = {Jeremy D. Frens and David S. Wise},

title = {QR Factorization with Morton-Ordered Quadtree Matrices for Memory Re-use and Parallelism},

booktitle = {In Proc. 2003 ACM Symp. on Principles and Practice of Parallel Programming},

year = {2003},

pages = {144--154}

}

### OpenURL

### Abstract

Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divide-and-conquer algorithm breaks problems down into independent computations. These independent computations can be dispatched in parallel for straightforward parallel processing. Proof-of-concept is given by an algorithm for QR factorization based on Givens rotations for quadtree matrices in Morton-order storage. The algorithms deliver positive results, competing with and even beating the LAPACK equivalent. Categories and subject descriptors:

### Citations

8530 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...ithout knowledge about its existence, let alone the particulars of each level of the hierarchy. In general (not restricted to divide-and-conquer algorithms), this phenomenon is called cache-oblivious =-=[14]-=-: the algorithm needs neither tuning nor specification of the memory system on which it is running. In contrast, tiling iterative algorithms for row-major matrices [20, Section 20.4.3] requires advanc... |

1958 | Matrix computations - Golub, Loan - 1996 |

955 | Advanced Compiler Design and Implementation - Muchnick - 1997 |

847 | Accuracy and Stability of Numerical Algorithms - Higham - 2002 |

570 | A Transformation System for Developing Recursive Programs
- Burstall, Darlington
- 1976
(Show Context)
Citation Context ...ftware pipelining. Optimizing compilers do this automatically for programmers. However, these same optimizing compilers have a poorer understanding of recursion, and they do not unfold the base cases =-=[6, 21]-=- of a recursive function. While this was known in earlier work [13], finding the right unfolding for the compiler to optimize into the best superscalar code has been the real key. Matrix-matrix multip... |

372 | Automatically tuned linear algebra software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ...r row-major matrices [20, Section 20.4.3] requires advanced knowledge of the sizes of the levels of the memory hierarchy, although there is software that can account for their impact from experiments =-=[4, 23]-=-. Furthermore, divide-and-conqueror algorithms have been advocated for parallelism [27]. The independent computations in such an algorithm can be executed in parallel without inter-process communicati... |

351 |
ScaLAPACK Users’ Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ...l lines. The Power Challenge does not come with a parallel implementation of dgeqrf(). Two solutions were attempted: linking LAPACK to a parallel BLAS and using a Power Challenge version of ScaLAPACK =-=[5]-=-. Both were unsuccessful. Linking to a parallel BLAS did not give enough parallelism; ScaLAPACK is intended for distributed systems which did not work well with the shared-memory on the Power Challeng... |

235 | LogGP: Incorporating long messages into the LogP model for parallel computation
- Alexandrov, Ionescu, et al.
- 1997
(Show Context)
Citation Context ...n parallel without inter-process communication [9, Section 1.3.1][3]. The time for both accessing the memory and communicating across processes must be reduced for efficient highperformance computing =-=[1, 10, 17]-=-. While the results are encouraging, matrix multiplication is relatively simple and straightforward; other problems are not as well patterned which could significantly affect the performance of the qu... |

227 | Optimizing Matrix Multiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology
- Bilmes, Asanovic, et al.
- 1997
(Show Context)
Citation Context ...r row-major matrices [20, Section 20.4.3] requires advanced knowledge of the sizes of the levels of the memory hierarchy, although there is software that can account for their impact from experiments =-=[4, 23]-=-. Furthermore, divide-and-conqueror algorithms have been advocated for parallelism [27]. The independent computations in such an algorithm can be executed in parallel without inter-process communicati... |

180 | The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, 3rd edition - Knuth - 1997 |

111 | Lapack: a portable linear algebra library for high-performance computers
- Anderson, Bai, et al.
- 1990
(Show Context)
Citation Context ...h these algorithms. One such problem is the QR factorization of a matrix [15, Section 5.2]. It is a common problem with algorithms implemented for row- and column-major matrices in the LAPACK library =-=[2]-=-. This paper tackles QR factorization for quadtree matrices stored with Morton-order indexing [12]. This paper is organized in seven sections, the first being this introductory section. The second sec... |

82 | LogP: A practical model of parallel computation
- Culler, Karp, et al.
- 1996
(Show Context)
Citation Context ...n parallel without inter-process communication [9, Section 1.3.1][3]. The time for both accessing the memory and communicating across processes must be reduced for efficient highperformance computing =-=[1, 10, 17]-=-. While the results are encouraging, matrix multiplication is relatively simple and straightforward; other problems are not as well patterned which could significantly affect the performance of the qu... |

76 | Auto-blocking matrix-multiplication or tracking blas3 performance from source code - FRENS, S - 1997 |

73 |
Exact analysis of the cache behavior of nested loops
- Chatterjee, Parker, et al.
- 2001
(Show Context)
Citation Context ...4 dswise@cs.indiana.edu General terms: Performance; Algorithms. Additional key words and phrases: storage management, indexing, quadtrees, swapping, cache misses, paging. 1. INTRODUCTION Earlier work =-=[13, 8, 12]-=- has explored matrix-matrix multiplication using quadtree matrices stored in Morton order. These results suggest that a Morton-order indexing [25, p. 776] of a quadtree matrix effectively blocks the m... |

50 | Applying recursion to serial and parallel QR factorization leads to better performance
- Elmroth, Gustavson
(Show Context)
Citation Context ...t also be tuned. Either programmers or optimizing compilers must know or obtain this information, although it can be collected automatically (e.g., by PHiPAC [4] or ATLAS [23]). Elmroth and Gustavson =-=[11]-=- take a recursive approach to QR factorization on row-major matrices using Householder transformations with the same blocking effect, saving a block of column updates so that matrix-matrix multiplicat... |

48 | Recursive array layout and fast parallel matrix multiplication - Chatterjee, Lebeck, et al. - 1999 |

26 | Ahnentafel Indexing into Morton-ordered Arrays, or Matrix Locality for Free
- Wise
- 2001
(Show Context)
Citation Context ...xt. A 0-indexed root works well for all trees without indexing gaps. the corresponding Morton-order indices by only a constant l−1 i=0 4i = (4 l − 1)/3 where l is the number of the level (zero-based) =-=[25]-=-. This conversion makes these indexings easily interchangeable. 2.2 Padding and Decorations The recursive, two-dimensional bifurcation of a quadtree matrix suggests that the order of the matrix should... |

26 |
Language support for morton-order matrices
- Wise, Frens, et al.
- 2001
(Show Context)
Citation Context ... must be done sequentially although the multiplications steps can be dispatched immediately in parallel. 5. THE CODE Matrix-matrix multiplication is implemented using the algorithms by Frens and Wise =-=[13, 26]-=-. The functions f and e are implemented in C as is the matrix multiplication algorithm. Drivers were written in C++. 5.1 Unfolding and Re-rolling the Base Case Iterative loops are routinely unrolled [... |

9 | The divide-and-conquer paradigm as a basis for parallel language design
- Axford
- 1992
(Show Context)
Citation Context ...-and-conqueror algorithms have been advocated for parallelism [27]. The independent computations in such an algorithm can be executed in parallel without inter-process communication [9, Section 1.3.1]=-=[3]-=-. The time for both accessing the memory and communicating across processes must be reduced for efficient highperformance computing [1, 10, 17]. While the results are encouraging, matrix multiplicatio... |

9 | Undulant block elimination and integer-preserving matrix inversion. Sci. Comput. Programming (to appear - Wise - 1995 |

6 | Recursion unrolling for divide and conquer programs
- Rugina, Rinard
- 2000
(Show Context)
Citation Context ...ftware pipelining. Optimizing compilers do this automatically for programmers. However, these same optimizing compilers have a poorer understanding of recursion, and they do not unfold the base cases =-=[6, 21]-=- of a recursive function. While this was known in earlier work [13], finding the right unfolding for the compiler to optimize into the best superscalar code has been the real key. Matrix-matrix multip... |

4 | Advances in Parallel Algorithms - Kronsjo, Shumsheruddin - 1992 |

3 |
Matrix Factorization Using a Block-Recursive Structure and Block-Recursive Algorithms
- Frens
- 2002
(Show Context)
Citation Context ...4 dswise@cs.indiana.edu General terms: Performance; Algorithms. Additional key words and phrases: storage management, indexing, quadtrees, swapping, cache misses, paging. 1. INTRODUCTION Earlier work =-=[13, 8, 12]-=- has explored matrix-matrix multiplication using quadtree matrices stored in Morton order. These results suggest that a Morton-order indexing [25, p. 776] of a quadtree matrix effectively blocks the m... |

1 | Parallel Software Designs, chapter 1 - Cole - 1992 |

1 |
LogGPS: A parallel computational model for synchonization analysis
- Ino, Fujimoto, et al.
- 2001
(Show Context)
Citation Context ...n parallel without inter-process communication [9, Section 1.3.1][3]. The time for both accessing the memory and communicating across processes must be reduced for efficient highperformance computing =-=[1, 10, 17]-=-. While the results are encouraging, matrix multiplication is relatively simple and straightforward; other problems are not as well patterned which could significantly affect the performance of the qu... |

1 | R8000 microprocessor chip set - Graphics, Inc - 1994 |