@MISC{Solar-lezama_cs267, author = {Armando Solar-lezama and Mark Hoemmen}, title = {CS 267 HOMEWORK 1: MATRIX MULTIPLY}, year = {} }

Share

OpenURL

Abstract

in the same order (column-major), but the multiply operation requires that entries of either A or B be loaded with stride M. (Without loss of generality, assume the A matrix.) Large strides result in ineffective use of cache lines, since (for sufficiently large M) each consecutive entry in a row of A wastes an entire cache line. This effectively reduces the total cache size, and exacerbates the effects of aliasing (especially when M is not co-prime with the number of cache lines). One possibility is to transform the A matrix into row-major order, by making a transposed copy of A in a pre-allocated buffer. Transposing A allows matrix multiply using unidirectional unit-stride loads of both A and B, with matrix multiply arranged as a sequence of dot products. This takes full advantage of spatial locality at all cache levels, permits optimizations such as prefetching, and uses cache space optimally. However, transposing the matrix doubles the total number of memory accesses: an entire matrix must be loaded and stored again. Furthermore, if the transpose itself is done “naively, ” that is, if the elements of A are simply read in row order and copied, the same problem of strided memory accesses returns. In fact, the problem is