Results 1  10
of
25
Compiletime Composition of Runtime Data and Iteration Reorderings
, 2003
"... Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compiletime. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications. ..."
Abstract

Cited by 54 (9 self)
 Add to MetaCart
Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compiletime. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications.
Impact of modern memory subsystems on cache optimizations for stencil computations
 MEMORY SYSTEM PERFORMANCE
, 2005
"... In this work we investigate the impact of evolving memory system features, such as large onchip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations ..."
Abstract

Cited by 39 (10 self)
 Add to MetaCart
In this work we investigate the impact of evolving memory system features, such as large onchip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cachebased memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cacheblocking behavior and demonstrate its effectiveness in predicting the stencilcomputation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cacheblocking optimizations.
Sparse Tiling for Stationary Iterative Methods
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applicati ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
(Show Context)
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
Loop Chaining: A Programming Abstraction For Balancing Locality and Parallelism
"... Abstract—There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding interprocedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or runtime system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domainspecific library, OP2, used in full applications with unstructured grids, and a domainspecific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. I.
Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles
, 2013
"... Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer “time” loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer “time” loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit intertile parallelism. One approach to tiling that enhances data locality without inhibiting intertile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate indexset splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a stateoftheart polyhedral compiler and with a domainspecific stencil compiler, both targeting CUDA GPUs.
The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering and Coloring
"... Partitioning and load balancing are important problems in scientific computing that can be modeled as combinatorial problems using graphs or hypergraphs. The Zoltan toolkit was developed primarily for partitioning and load balancing to support dynamic parallel applications, but has expanded to suppo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Partitioning and load balancing are important problems in scientific computing that can be modeled as combinatorial problems using graphs or hypergraphs. The Zoltan toolkit was developed primarily for partitioning and load balancing to support dynamic parallel applications, but has expanded to support other problems in combinatorial scientific computing, including matrix ordering and graph coloring. Zoltan is based on abstract user interfaces and uses callback functions. To simplify the use and integration of Zoltan with other matrixbased frameworks, such as the ones in Trilinos, we developed Isorropia as a Trilinos package, which supports most of Zoltan’s features via a matrixbased interface. In addition to providing an easytouse matrixbased interface to Zoltan, Isorropia also serves as a platform for additional matrix algorithms. In this paper, we give an overview of the Zoltan and Isorropia toolkits, their design, capabilities and use. We also show how Zoltan and Isorropia enable largescale, parallel scientific simulations, and describe current and future development in the nextgeneration package Zoltan2. 1
Executing Optimized Irregular Applications Using Task Graphs Within Existing Parallel Models
"... Abstract—Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling. These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs. Unfortunately, most popular parallel progr ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling. These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs. Unfortunately, most popular parallel programming models with the exception of Threading Building Blocks (TBB) do not directly execute arbitrary task graphs. In this paper, we compare the programming and execution of arbitrary task graphs qualitatively and quantitatively in TBB, the OpenMP doall model, the OpenMP 3.0 task model, and Cilk Plus. We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark. I.
Set and Relation Manipulation for the Sparse Polyhedral Framework
"... Abstract. The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compiletime specification of runtime reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of it ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compiletime specification of runtime reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of iteration across loops. The Polyhedral Model represents sets of iteration points in imperfectly nested loops with unions of polyhedral and represents loop transformations with affine functions applied to such polyhedra sets. Existing tools such as ISL, Cloog, and Omega manipulate polyhedral sets and affine functions, however the ability to represent the sets and functions where some of the constraints include uninterpreted function calls such as those needed in the SPF is nonexistant or severely restricted. This paper presents algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework. The algorithms have been implemented in an open source, C++ library called IEGenLib (The Inspector/Executor Generator Library). 1
Applications of algebraic multigrid to largescale finite element analysis of whole bone micromechanics
 on the IBM SP, in ACM/IEEE Proceedings of SC2003: High Performance Networking and Computing
, 2003
"... Abstract. Accurate microfinite element analyses of whole bones require the solution of large sets of algebraic equations. Multigrid has proven to be an effective approach to the design of highly scalable linear solvers for solid mechanics problems. We present some of the first applications of scala ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. Accurate microfinite element analyses of whole bones require the solution of large sets of algebraic equations. Multigrid has proven to be an effective approach to the design of highly scalable linear solvers for solid mechanics problems. We present some of the first applications of scalable linear solvers, on massively parallel computers, to whole vertebral body structural analysis. We analyze the performance of our algebraic multigrid (AMG) methods on problems with over 237 million degrees of freedom on IBM SP parallel computers. We demonstrate excellent parallel scalability, both in the algorithms and the implementations, and analyze the nodal performance of the important AMG kernels on the IBM Power3 and Power4 architectures. Key words. multigrid, trabecular bone, human vertebral body, finite element method, massively parallel computing.
On the Scalability of Loop Tiling Techniques
"... The Polyhedral model has proven to be a valuable tool for improving memory locality and exploiting parallelism for optimizing dense array codes. This model is expressive enough to describe transformations of imperfectly nested loops, and to capture a variety of program transformations, including man ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The Polyhedral model has proven to be a valuable tool for improving memory locality and exploiting parallelism for optimizing dense array codes. This model is expressive enough to describe transformations of imperfectly nested loops, and to capture a variety of program transformations, including many approaches to loop tiling. Tools such as the highly successful PLuTo automatic parallelizer have provided empirical confirmation of the success of polyhedralbased optimization, through experiments in which a number of benchmarks have been executed on machines with small to mediumscale parallelism. In anticipation of ever higher degrees of parallelism, we have explored the impact of various loop tiling strategies on the asymptotic degree of available parallelism. In our analysis, we consider “weak scaling ” as described by Gustavson, i.e., in which the data set size grows linearly with the number of processors available. Some, but not all, of the approaches to tiling provide weak scaling. In particular, the tiling currently performed by PLuTo does not scale in this sense. In this article, we review approaches to loop tiling in the published literature, focusing on both scalability and implementation status. We find that fully scalable tilings are not available in generalpurpose tools, and call upon the polyhedral compilation community to focus on questions of asymptotic scalability. Finally, we identify ongoing work that may resolve this issue. 1.