## Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings (2001)

### Cached

### Download Links

- [www.cs.rice.edu]
- [cacs.usc.edu]
- [www.cs.fsu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | International Journal of Parallel Programming |

Citations: | 93 - 2 self |

### BibTeX

@INPROCEEDINGS{Mellor-crummey01improvingmemory,

author = {John Mellor-crummey and David Whalley and Ken Kennedy},

title = {Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings},

booktitle = {International Journal of Parallel Programming},

year = {2001},

pages = {425--433}

}

### Years of Citing Articles

### OpenURL

### Abstract

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

### Citations

807 |
Charmm: A program for macromolecular energy, minimization, and dynamics calculations
- Brooks, Bruccoleri, et al.
- 1983
(Show Context)
Citation Context ...a TLB entry. 5.1. The Moldyn Benchmark Moldyn is a synthetic benchmark for molecular dynamics simulation. The computational structure in moldyn is similar to the nonbonded force calculation in CHARMM =-=[33]-=-, and closely resembles the structure represented in Figure 1 of the paper. An interaction list is constructed for all pairs of interactions that are within a specified cutoff radius. These interactio... |

744 |
The Art of Computer Programming, Volume 3: Sorting and Searching
- Knuth
- 1973
(Show Context)
Citation Context ...ple, computation order is represented by an interaction list and we can block computation by sorting interactions by the block numbers of the particles they reference. Applying a lexicographical sort =-=[32]-=- to the interaction pairs using [block_of(p1), block_of(p2)] as the sorting key for pair [p1,p2] achieves a single level of blocking. To block for a multi-level memory hierarchy in a machine-independe... |

738 | A data locality optimizing algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...ieving high performance on such systems requires tailoring the reference behavior of applications to better match the characteristics of a machine’s memory hierarchy. Techniques such as loop blockin=-=g [1, 2, 3, 4, 5, 6] a-=-nd data prefetching [4, 7, 8] have significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren’t as effective for irregular appli... |

534 |
Computer Solution of Large Sparse Positive Definite Matrices
- GEORGE, LIU
- 1981
(Show Context)
Citation Context ...as et al. [16] applied breadth-first traversal strategy known as Reverse Cuthill-McKee to order elements in an unstructured mesh to improve locality. This reordering technique was developed by George =-=[26] f-=-or a different purpose: bandwidth and profile minimization of sparse matrices. George’s strategy was a refinement of a breadth-first ordering technique developed by Cuthill and McKee [27]. The Cuthi... |

531 | The cache performance and optimizations of blocked algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...ieving high performance on such systems requires tailoring the reference behavior of applications to better match the characteristics of a machine’s memory hierarchy. Techniques such as loop blockin=-=g [1, 2, 3, 4, 5, 6] a-=-nd data prefetching [4, 7, 8] have significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren’t as effective for irregular appli... |

511 | Parallel Multilevel k-way Partition Scheme for Irregular Graphs
- Karypis, Kumar
- 1999
(Show Context)
Citation Context ...rder (of the (src,dest) edge pairs), and Hilbert order. Comparing different data orderings with random is interesting because the parallel version of the CHAD code uses the ParMETIS graph partitioner =-=[34-=-] to partition the nodes and edges of the -17computational mesh among available processors. ParMETIS computes its partitionings in a hierarchical fashion and swaps nodes and edges between partitions. ... |

478 | Design and Evaluation of A Compiler Algorithm for Prefetching
- Mowry, Lam, et al.
- 1992
(Show Context)
Citation Context ...requires tailoring the reference behavior of applications to better match the characteristics of a machine’s memory hierarchy. Techniques such as loop blocking [1, 2, 3, 4, 5, 6] and data prefetchin=-=g [4, 7, 8] h-=-ave significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren’t as effective for irregular applications. Improving performance ... |

306 | Improving Data Locality with Loop Transformations
- McKinley, Carr, et al.
- 1996
(Show Context)
Citation Context ...s a standard approach to achieving stride-1 access in regular computations. This transformation has been specifically studied in the context of memory hierarchy improvement by a number of researchers =-=[13, 14]-=-. As described earlier, data reordering can be used to reduce bandwidth requirements of irregular applications. Ding and Kennedy [15] explored compiler and run-time support for a class of run-time dat... |

306 | Applications of Spatial Data Structures
- Samet
- 1990
(Show Context)
Citation Context ...ng curve by applying a sequence of bit-level logical operations to its d-dimensional coordinates. Space-filling curves, their properties, and the details of their construction are described elsewhere =-=[17, 18]-=-. A Hilbert space-filling curve is one type of space-filling curve. Figure 2 shows a fifth-order Hilbert curve in two dimensions. This curve has an important property: its recursive structure preserve... |

300 |
Space-filling curves
- Sagan
- 1994
(Show Context)
Citation Context ...ng curve by applying a sequence of bit-level logical operations to its d-dimensional coordinates. Space-filling curves, their properties, and the details of their construction are described elsewhere =-=[17, 18]-=-. A Hilbert space-filling curve is one type of space-filling curve. Figure 2 shows a fifth-order Hilbert curve in two dimensions. This curve has an important property: its recursive structure preserve... |

246 |
Strategies for cache and local memory management by global program transformation
- Gannon, Jalby, et al.
- 1988
(Show Context)
Citation Context ...ieving high performance on such systems requires tailoring the reference behavior of applications to better match the characteristics of a machine’s memory hierarchy. Techniques such as loop blockin=-=g [1, 2, 3, 4, 5, 6] a-=-nd data prefetching [4, 7, 8] have significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren’t as effective for irregular appli... |

212 | Improving Register Allocation for Subscripted Variables
- Callahan, Carr, et al.
- 1990
(Show Context)
Citation Context |

154 | A parallel hashed oct-tree N-body algorithm", Supercomputing
- Warren, Salmon
- 1993
(Show Context)
Citation Context ...elated ordering techniques [19] have been used to partition data and computation among processors in parallel computer systems. They hav e been applied in problem domains that include n-body problems =-=[20, 19]-=-, graph partitioning [21], and adaptive mesh refinement [22]. Ordering data elements by their position along a space-filling curve and assigning each processor a contiguous range of elements of equal ... |

149 | Data-centric multilevel blocking
- Kodukula, Ahmed, et al.
- 1997
(Show Context)
Citation Context ...ss of multi-level blocking techniques on dense linear algebra [11] and a paper by Kodukula et al. presents a data-centric blocking algorithm that can be effectively applied to multi-level hierarchies =-=[12-=-]. -3The principal strategy for improving bandwidth utilization for regular problems, aside from blocking for reuse, has been to transform the program to increase spatial locality. Loop interchange is... |

129 |
Software Methods for Improvement of Cache Performance
- Porterfield
- 1989
(Show Context)
Citation Context |

100 |
On estimating and enhancing cache effectiveness
- Ferrante, Sarkar, et al.
- 1991
(Show Context)
Citation Context |

95 | The design and implementation of a parallel unstructured euler solver using software primitives
- Das, Mavriplis, et al.
- 1992
(Show Context)
Citation Context ...omposed of tuples of particles or objects, they apply a grouping transformation to order tuples in the sequence to consider all interactions involving one object before moving to the next. Das et al. =-=[16]-=- applied this same computation reordering in an unstructured mesh application. Ding and Kennedy [15] did not specifically consider reordering for multi-level memory hierarchies although they proposed ... |

83 | Cache-oblivious algorithms
- Prokop
- 1999
(Show Context)
Citation Context ...ce or the average degree of nodes in an unstructured mesh. Over the last several years, recursive divide and conquer strategies have been advocated for blocking regular computations for machines with =-=(24, 31)-=- multi-level memory hierarchies in an architecture-independent fashion. The rationale for this approach is that if the computation at a particular level of recursion doesn't fit into some level of the... |

77 | Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code
- Frens, Wise
- 1997
(Show Context)
Citation Context ...atrix multiplication, Thottethodi et al. [23] explored ordering matrix elements by their position along a space-filling curve rather than typical row-major or column-major orderings, and Frens & Wise =-=[24]-=- proposed recursive matrix layouts based on quad trees. The hierarchical locality resulting from these recursively defined orderings is a good match for divideand-conquer matrix algorithms. Several re... |

74 | On partitioning dynamic adaptive grid hierarchies
- Parashar, Browne
- 1996
(Show Context)
Citation Context ... and computation among processors in parallel computer systems. They hav e been applied in problem domains that include n-body problems [20, 19], graph partitioning [21], and adaptive mesh refinement =-=[22]-=-. Ordering data elements by their position along a space-filling curve and assigning each processor a contiguous range of elements of equal (possibly weighted) size is a fast partitioning technique th... |

67 |
Automatic loop interchange
- Allen, Kennedy
- 1984
(Show Context)
Citation Context ...s a standard approach to achieving stride-1 access in regular computations. This transformation has been specifically studied in the context of memory hierarchy improvement by a number of researchers =-=[13, 14]-=-. As described earlier, data reordering can be used to reduce bandwidth requirements of irregular applications. Ding and Kennedy [15] explored compiler and run-time support for a class of run-time dat... |

67 | Load balancing and data locality in adaptive hierarchical N -body methods: Barnes-Hut, fast multipole, and radiosity
- Singh, Holt, et al.
- 1995
(Show Context)
Citation Context ...ensions. popular because they are simple to compute: a point's position along the curve is determined by a bitwise interleaving of its coordinates. Space-filling curves or related ordering techniques =-=(19)-=- have been used to partition data and computation among processors in parallel computer systems. They have been applied in problem domains that include n-body problems, (19, 20) graph partitioning, (2... |

51 | Localizing non-affine array references
- Mitchell, Carter, et al.
- 1999
(Show Context)
Citation Context ...iclesDO FOR j=itonumber of blocks of particlesDO process interactions between all interacting particle pairs with the first particle in block i and the second in block j Mitchell, Carter and Ferrante =-=[29]-=- concurrently developed a related blocking technique for irregular references that they call buck et tiling. They improve the locality of a stream of accesses for a single non-affine reference by reor... |

39 | Effective Cache Prefetching on Bus-Based Multiprocessors
- Tullsen, Eggers
- 1995
(Show Context)
Citation Context ...requires tailoring the reference behavior of applications to better match the characteristics of a machine’s memory hierarchy. Techniques such as loop blocking [1, 2, 3, 4, 5, 6] and data prefetchin=-=g [4, 7, 8] h-=-ave significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren’t as effective for irregular applications. Improving performance ... |

34 | Memory hierarchy management for iterative graph structures
- Al-Furaih, Ranka
(Show Context)
Citation Context ...divideand-conquer matrix algorithms. Several researchers have inv estigated strategies for improving memory hierarchy performance for algorithms on graphs and unstructured meshes. Al-Furaih and Ranka =-=[25]-=- used a simple breadth-first node numbering. Das et al. [16] applied breadth-first traversal strategy known as Reverse Cuthill-McKee to order elements in an unstructured mesh to improve locality. This... |

34 |
An Algorithm for Profile and Wavefront Reduction of Sparse Matrices
- Sloan
- 1986
(Show Context)
Citation Context ...ted graph and renumber graph nodes using a breadth-first traversal in which all unnumbered neighbors of a node x are added to a FIFO queue of nodes to be numbered by order of increasing degree. Sloan =-=[28]-=- developed a related but more sophisticated reordering strategy. First, he more carefully selects the first node in the ordering to yield orderings with narrower level structure. Then, at each step in... |

32 |
Improving cache performance of dynamic applications with computation and data layout transformations
- Ding, Kennedy
- 1999
(Show Context)
Citation Context ...ext of memory hierarchy improvement by a number of researchers [13, 14]. As described earlier, data reordering can be used to reduce bandwidth requirements of irregular applications. Ding and Kennedy =-=[15]-=- explored compiler and run-time support for a class of run-time data reordering techniques. They examine an access sequence and use it to greedily reorder data aiming to increase spatial locality as t... |

32 | Code transformations to improve memory parallelism
- PAI, ADVE
- 1999
(Show Context)
Citation Context ...rve based reorderings to improve the parallel efficiency of shared-memory and software distributed shared memory computations by improving data locality, which reduces communication and false sharing =-=[36, 37]-=-. Our experiences show that good data and computation orders can be achieved for irregular problems using dynamic reorderings, and that the gain in locality from using good data and computation orders... |

25 |
Automatic program transformations for virtual memory computers
- Abu-Sufah, Kuck, et al.
- 1979
(Show Context)
Citation Context ...r. 2. Related Work Blocking for improving the performance of memory hierarchies has been a subject of research for the last few decades. Early papers focused on blocking to improve paging performance =-=[9, 10]-=-, but recent work has focused more narrowly on improving cache performance [2, 5, 4, 6]. Techniques similar to blocking have also been effectively applied to improvement of reuse in registers [1]. Mos... |

15 |
The organization of matrices and matrix operations in a paged multiprogramming environment
- McKeller, Coffman
- 1969
(Show Context)
Citation Context ...r. 2. Related Work Blocking for improving the performance of memory hierarchies has been a subject of research for the last few decades. Early papers focused on blocking to improve paging performance =-=[9, 10]-=-, but recent work has focused more narrowly on improving cache performance [2, 5, 4, 6]. Techniques similar to blocking have also been effectively applied to improvement of reuse in registers [1]. Mos... |

15 | W.: Improving fine-grained irregular sharedmemory benchmarks by data reordering. In: Supercomputing
- Hu, Cox, et al.
- 2000
(Show Context)
Citation Context ...rve based reorderings to improve the parallel efficiency of shared-memory and software distributed shared memory computations by improving data locality, which reduces communication and false sharing =-=[36, 37]-=-. Our experiences show that good data and computation orders can be achieved for irregular problems using dynamic reorderings, and that the gain in locality from using good data and computation orders... |

14 |
Design and Evaluation of a Compiler Algorithm for
- Mowry, Lam, et al.
- 1992
(Show Context)
Citation Context ...such systems requires tailoring the reference behavior of applications to better match the characteristics of a machine's memory hierarchy. Techniques such as loop blocking (1 6) and data prefetching =-=(4,7,8)-=- have significantly improved memory hierarchy utilization for regular applications. A limitation of these techniques is that they aren't as effective for irregular applications. Improving performance ... |

10 |
Reducing the Bandwidth of Sparse Symmetric
- Cuthill, McKee
- 1969
(Show Context)
Citation Context ... by George [26] for a different purpose: bandwidth and profile minimization of sparse matrices. George’s strategy was a refinement of a breadth-first ordering technique developed by Cuthill and McKe=-=e [27]-=-. The Cuthill-McKee and Reverse Cuthill-McKee orderings use an adjacency list representation of an undirected graph and renumber graph nodes using a breadth-first traversal in which all unnumbered nei... |

8 | Architecture-independent locality-improving transformations of computational graphs embedded in k-dimensions
- Ou, Gunwani, et al.
- 1994
(Show Context)
Citation Context ... been used to partition data and computation among processors in parallel computer systems. They hav e been applied in problems domains that include n-body problems [WaS93, SHT95], graph partitioning =-=[OGR95]-=-, and adaptive mesh refinement [PaB96]. Ordering data elements by their position along a space-filling curve and assigning each processor a contiguous range of elements of equal (possibly weighted) si... |

7 |
Cache-oblivious algorithms. Master’s thesis, Department of Electrical Engineering and Computer Science at the Massachussets Institute of Technology
- Prokop
- 1999
(Show Context)
Citation Context ...st several years, recursive divide and conquer strategies have been advocated for blocking regular computations for machines with multi-level memory hierarchies in an architecture-independent fashion =-=[31, 24]. -=-The rationale for this approach is that if the computation at a particular level of recursion doesn’t fit into some level of the memory hierarchy, the computation at some deeper level of recursion w... |

3 |
Load Balancing and Data Locality
- Singh, Holt, et al.
- 1995
(Show Context)
Citation Context ...s are popular because they are simple to compute: a point’s position along the curve is determined by a bitwise interleaving of its coordinates. -4Space-filling curves or related ordering techniques=-= [19]-=- have been used to partition data and computation among processors in parallel computer systems. They hav e been applied in problem domains that include n-body problems [20, 19], graph partitioning [2... |

2 |
Architecture-Independent Locality-Improving Transformations of Computational Graphs Embedded
- Ou, Gunwani, et al.
- 1995
(Show Context)
Citation Context ...9] have been used to partition data and computation among processors in parallel computer systems. They hav e been applied in problem domains that include n-body problems [20, 19], graph partitioning =-=[21]-=-, and adaptive mesh refinement [22]. Ordering data elements by their position along a space-filling curve and assigning each processor a contiguous range of elements of equal (possibly weighted) size ... |

2 |
Tuning Strassen’s Matrix Multiplication Algorithm for Memory Efficiency
- Thottethodi, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ...han the other methods they studied. Several researchers have proposed using recursive data layouts for computation on dense matrices. To improve locality for matrix multiplication, Thottethodi et al. =-=[23]-=- explored ordering matrix elements by their position along a space-filling curve rather than typical row-major or column-major orderings, and Frens & Wise [24] proposed recursive matrix layouts based ... |

1 |
Personal Communication
- Robey
- 2000
(Show Context)
Citation Context ...hierarchical fashion and swaps nodes and edges between partitions. After partitioning, the locality properties of the mesh pieces are believed to resemble those for meshes with random node orderings. =-=(35)-=- We compute Hilbert order for nodes by normalizing each node's X, Y, and Z coordinates, which are a triple of integer coordinates, each in the range [0..2 21 ]. We then convert this triple to a positi... |