## The Uniform Memory Hierarchy Model of Computation (1992)

Venue: | Algorithmica |

Citations: | 112 - 9 self |

### BibTeX

@ARTICLE{Alpern92theuniform,

author = {Bowen Alpern and Larry Carter and Ephraim Feig and Ted Selker},

title = {The Uniform Memory Hierarchy Model of Computation},

journal = {Algorithmica},

year = {1992},

volume = {12},

pages = {12--2}

}

### Years of Citing Articles

### OpenURL

### Abstract

The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies. A sequential computer's memory is modelled as a sequence hM 0 ; M 1 ; :::i of increasingly large memory modules. Computation takes place in M 0 . Thus, M 0 might model a computer's central processor, while M 1 might be cache memory, M 2 main memory, and so on. For each module M U , a bus B U connects it with the next larger module M U+1 . All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parameterized by the rate at which the blocksizes i...

### Citations

2453 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...e. This paper presents a model of computation that captures these performance-relevant characteristics of computers. Big-O analysis on the traditional Random Access Machine (RAM) model of computation =-=[AHU74]-=- ignores the non-uniform cost of memory accesses. Section 2 illustrates the gap between traditional theory and practice on a naive matrix multiplication program. For the RAM model, where every memory ... |

1134 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ... ae ! �� (i.e., with more than N processors), multiplication of N \Theta N matrices will be communication-bound. The following theorem shows that a program closely related to one described by Vali=-=ant [V90]-=- achieves communication-efficiency for transfer cost functions not proscribed by the theorem above. 32 The uniform parallel model is not realistic of today's computers. Typically, these have branching... |

829 | The program dependence graph and its use in optimization
- Ferrante, Ottenstein, et al.
- 1987
(Show Context)
Citation Context ... algorithms in this paper are described informally, but they may be formally specified in a concurrent programming notation or as computational circuits, dataflow graphs, or program dependence graphs =-=[FOW87]-=-. The matrix multiplication solid of Figure 2 is an example of an algorithm; there are different algorithms for matrix multiplication that use fewer than O(N 3 ) operations [AHU74]. 2. A program. A pr... |

743 | A set of level 3 basic linear algebra subprograms - Dongarra, Croz, et al. - 1990 |

422 |
LAPACK User's Guide
- Anderson, Bai, et al.
- 1992
(Show Context)
Citation Context ...omputes C := C + A B, where A, B, and C are matrices 5 . If A, B, and C are square N \Theta N matrices, the assignment statement is iterated 5 Linear algebra packages, such as ESSL [IBM86] and LAPACK =-=[ABetc92] use this -=-"update" form since it facilitates problem decomposition. Analysis of C := A B is similiar. NaiveMM(A[1:n,1:l], B[1:l,1:m], C[1:n,1:m]): real value: A, B; value result: C integer value: n, m... |

318 |
Computational Frameworks for the Fast Fourier Transform
- Loan
- 1992
(Show Context)
Citation Context ...plex numbers, defined by y p = N \Gamma1 X q=0 ! pq N x q (0sp ! N); where !N is the N-th root of unity, e \Gamma2��i=N . A Fast Fourier Transform (FFT) is a technique for efficiently computing a =-=DFT [V-L92]-=-. Many FFT algorithms make log N passes through the data. On UMH ff;ae;cU , the model of interest, such techniques require moving\Omega\Gamma N log N) data items on the topmost bus, which takes \Omega... |

302 |
Advanced Compiler Optimizations for Supercomputers
- Padua, Wolfe
- 1986
(Show Context)
Citation Context ...ystem/6000 model 530 computer, hereafter the RS/6000. 3 Matrix multiplication is sufficiently simple that it may suffice to use a compiler with such optimizations as strip mining and loop interchange =-=[PW86]-=-. take place in M 0 . If a program is written against the MH model, and the model's parameters reflect a particular computer, then the program can be translated to run efficiently on the computer. (Tr... |

245 | Supernode partitioning - Irigoin, Triolet - 1988 |

167 | I/O complexity: the red-blue pebbling game - HONG, KUNG - 1981 |

130 | Ffts in external of hierarchical memory
- Bailey
- 1989
(Show Context)
Citation Context ...ication-efficient program. This section will establish that the identity function (in fact, any linear function) is a threshold function for this FFT algorithm. After the standard "Four-Step"=-=; program [B90]-=- is shown to be communication-bound for any linear transfer cost function, it will be rechoreographed to be communicationefficient. To simplify presentation, we will only consider the case where N = 2... |

129 |
A model for hierarchical memory
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ... modules and a single lower module. It might be possible to incorporate this feature into the UMH model. Our work is closely related to, and heavily influenced by, the Hierarchical Memory Model (HMM) =-=[AACS87]-=- and the Block Transfer model (BT ) [ACS87], both of which have multiple levels. This section explores the relationship between the UMH model and the HMM and BT models. Each model is a family of machi... |

111 |
Hierarchical memory with block transfer
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ...nt of each column begins a level-U block.) Section 5 4 Simultaneous data transfers on the multiple buses overcome the\Omega\Gamma N 2 log log N ) lowerbound for the corresponding Block Transfer model =-=[ACS87]-=-. also shows than no program for transposing a square matrix can have an asymptotic communication efficiency of 1. Section 6 shows how to rechoreograph the naive matrix multiplication program of Secti... |

69 |
Impact of hierarchical memory systems on linear algebra algorithm design
- Gallivan, Jalby, et al.
- 1988
(Show Context)
Citation Context ... factor. 1 Overview Theoretical computer science does not address certain performance issues important for creating scientific software. Careful tuning can speed up a program by an order of magnitude =-=[GJMS88]-=-. These improvements follow from taking into account various aspects of the memory hierarchy of the target machine. This paper presents a model of computation that captures these performance-relevant ... |

62 |
Type Architectures, Shared Memory, and the Corollary of Modest Potential
- Snyder
- 1986
(Show Context)
Citation Context ...have evolved to support the RAM illusion, we hope they will evolve towards the UMH and UPMH models. A third justification pertains to parallelism. There is a recognized need for a "type architect=-=ure" [S86] or "-=-bridging model" [V90] that makes a closer connection between parallel algorithm design and actual multiprocessors. When looking for such a model, people often start with the assumption that the R... |

52 |
Optimal Disk I/O with Parallel Block Transfer
- Vitter, Shriver
- 1990
(Show Context)
Citation Context ...s such as what "shape" subproblems are suitable for further subdividing, and how should the reception of a problem be coordinated with the dispatching of subproblems. The models of Vitter an=-=d Shriver [VS90]-=- focus on an orthogonal aspect of some memories --- particularly disk storage --- that simultaneous data transfers may be possible between separate memory modules and a single lower module. It might b... |

38 |
Permuting information in idealized two-level storage
- Floyd
- 1972
(Show Context)
Citation Context ...on on a range of processors. A module of the Parallel Memory Hierarchy (PMH) model can be connected to more than one module at the level beneath it in the hierarchy, 29 Floyd's two-level memory model =-=[F72]-=- also requires \Theta(N 2 log N ) time because the lower level of this model only holds a constant number (3) of size N blocks. giving rise to a tree of modules with processors at the leaves 30 . A va... |

37 | Blocking Linear Algebra Codes for Memory Hierarchies
- Carr, Kennedy
- 1989
(Show Context)
Citation Context ...er technology (e.g., strip mining and loop interchange [PW86]) can automatically make these improvements to Program 1. On slightly more complicated problems, the capabilities of compilers are limited =-=[CK89]-=-. And for even more complex problems, there are improvements that we cannot expect a compiler to discover. The replacement of recursive transposes by a single bit-reversal permutation in the Fast Four... |

26 | Organizing matrices and matrix operations for paged memory systems - McKellar, Coffman - 1969 |

22 | Vector and parallel algorithms for Cholesky factorization - AGARWAL, GUSTAVSON - 1989 |

18 | The influence of memory hierarchy on algorithm organization: Programming FFTs on a vector multiprocessor. In The Characteristics of Parallel Algorithm - Gannon, Jalby - 1987 |

10 | Trade-offs between communication and space - Lam, Tiwari, et al. - 1992 |

9 |
Visualizing computer memory architectures
- Alpern, Carter, et al.
- 1990
(Show Context)
Citation Context ...the previous module. B U copies a level-U block atomically in l U cycles to or from level-(U + 1), overwriting the old contents. MH oe can be depicted as a tower of modules with level 0 at the bottom =-=[ACS90]-=-. Two memory hierarchies are shown in Figure 3 (the second is a UMH, explained in the next section.) In the figure, the horizontal and vertical dimensions of the rectangle depicting M U are proportion... |

7 |
Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer
- Carlson
- 1990
(Show Context)
Citation Context ...a two-level memory hierarchy. The fact that the UMH model has more than two levels has several consequences. First, it is more natural for asymptotic analysis of algorithms, since a single 26 Carlson =-=[C90a]-=- reports improvement obtained by carefully exploiting the local memory attached to each processor of the Cray-2. Section 9 discusses modeling such computers. machine can handle problems of all sizes. ... |

6 | RISC System/6000 hardware overview - Bakoglu, Whitside - 1990 |

5 |
Matrix algebra programs for the UNIVAC
- Rutledge, Rubinstein
- 1951
(Show Context)
Citation Context ...ltiplication with O(N 3 ) running time on UMH 6;ae;ae U =4 . That, together with lemma 4.1, shows that ae U =4 is a threshold function. Program 3, based on techniques going back at least thirty years =-=[RR51]-=-, recursively dices the matrix multiplication solid of Section 2 so as be communication efficient on UMH 6;ae;ae U =4 . Theorem 6.1 Suppose that N = ae W and that N \Theta N matrices A, B, and C are a... |

3 | IO Complexity of Sorting and Related Problems," CACM - Aggarwal, Vitter - 1988 |

3 |
Fast permuting on disk arrays
- Corman
- 1993
(Show Context)
Citation Context ...are essentially the same as for transposition.sA generalization of the class of rational permutations (also called "bit permute" permutations) is the class of bit permute with complement per=-=mutations [C92]-=-. Under such a permutation, the i-th input element is permuted to an address determined by rearranging the bits of i and exclusive-oring the result with some constant c. Techniques of the previous pro... |

1 |
The RAM Model and the Performance Programmer
- Carter
- 1990
(Show Context)
Citation Context ... a RAM . There is no question that the RAM paradigm makes achieving a moderate level of performance easier, but it is worth questioning whether the RAM model is helpful for attaining high performance =-=[C90b]-=-. Performance programmers spend days rewriting inner loops to trick their compiler's register allocator, analyzing how 2-way or 4-way associative caches will behave on certain programs, and learning t... |

1 | IBM RISC System/6000: A Business Perspective, Second Edition - Hoskins - 1992 |