Results 1 - 10
of
36
Compile-time Composition of Run-time Data and Iteration Reorderings
, 2003
"... Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications. ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications.
Memory Characteristics of Iterative Methods
, 1999
"... Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that th ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the current DRAM technology cannot provide the data fast enough to keep the CPU busy. Although the fundamentals of cache optimizations are quite simple, current compilers cannot optimize even elementary iterative schemes. In this paper, we analyze the memory and cache behavior of iterative methods with extensive profiling and describe program transformation techniques to improve the cache performance of two- and three-dimensional multigrid algorithms. 1 Introduction Multigrid methods [11, 5] are among the most attractive algorithms for the solution of large sparse systems of equations that arise in the solution of elliptic partial differential equations (PDEs). However, even simple multi...
An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms
- Algorithms for Memory Hierarchies — Advanced Lectures, volume 2625 of Lecture Notes in Computer Science
, 2003
"... this paper focuses on optimization techniques for enhancing cache performance ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
this paper focuses on optimization techniques for enhancing cache performance
Streaming Multigrid for Gradient-Domain Operations on Large Images
"... We introduce a new tool to solve the large linear systems arising from gradient-domain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over out-of-core data. This fast solution is enabled by a combination of three techniques: (1) use of ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
We introduce a new tool to solve the large linear systems arising from gradient-domain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over out-of-core data. This fast solution is enabled by a combination of three techniques: (1) use of second-order finite elements (rather than traditional finite differences) to reach sufficient accuracy in a single V-cycle, (2) temporally blocked relaxation, and (3) multi-level streaming to pipeline the restriction and prolongation phases into single streaming passes. A key contribution is the extension of the B-spline finite-element method to be compatible with the forward-difference gradient representation commonly used with images. Our streaming solver is also efficient for inmemory images, due to its fast convergence and excellent cache behavior. Remarkably, it can outperform spatially adaptive solvers that exploit application-specific knowledge. We demonstrate seamless stitching and tone-mapping of gigapixel images in about an hour on a notebook PC. Keywords: out-of-core multigrid solver, B-spline finite elements, Poisson equation, gigapixel images, multi-level streaming. 1
Sketching Stencils
"... Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve non-traditional transformations, whi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve non-traditional transformations, which means that they cannot be produced by optimizing the reference stencil with a compiler. In fact, many stencils are produced by code generators that were tediously handcrafted. In this paper, we show how stencil implementations can be produced with sketching. Sketching is a software synthesis approach where the programmer develops a partial implementation— a sketch—and a separate specification of the desired functionality given by a reference (unoptimized) stencil. The synthesizer then completes the sketch to behave like the specification, filling in code fragments that are difficult to develop manually. Existing sketching systems work only for small finite programs, i.e., programs that can be represented as small Boolean circuits. In this paper, we develop a sketching synthesizer that works for stencil computations, a large class of programs that, unlike circuits, have unbounded inputs and outputs, as well as an unbounded number of computations. The key contribution is a reduction algorithm that turns a stencil into a circuit, allowing us to synthesize stencils using an existing sketching synthesizer.
Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling
- in 15th Workshop on Languages and Compilers for Parallel Computing (LCPC
, 2002
"... Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as Gauss-Seidel. To improve performance, iterative smoothers can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. Current met ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as Gauss-Seidel. To improve performance, iterative smoothers can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. Current methods for parallelizing Gauss-Seidel on irregular grids, such as multi-coloring and ownercomputes based techniques, exploit parallelism and possibly intra-iteration data reuse but not inter-iteration data reuse. Sparse tiling techniques were developed to improve intra-iteration and inter-iteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of Gauss-Seidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to owner-computes based Gauss-Seidel methods. The latter employ only parallelism and intra-iteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intraiteration, and inter-iteration data locality) are combined.
Adaptive hybrid FEM/FDM methods for inverse scattering problems
- Department of Mathematics; Chalmers University of Technology & Goteborg University
, 2002
"... This thesis is devoted to adaptive hybrid finite element / finite difference methods for an inverse scattering problem for the time-dependent acoustic wave equation in 2D and 3D, where we seek to reconstruct an unknown sound velocity c(x) from measured wave-reflection data. ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
This thesis is devoted to adaptive hybrid finite element / finite difference methods for an inverse scattering problem for the time-dependent acoustic wave equation in 2D and 3D, where we seek to reconstruct an unknown sound velocity c(x) from measured wave-reflection data.
Rescheduling for Locality in Sparse Matrix Computations
- Proceedings of the 2001 International Conference on Computational Science, Lecture Notes in Computer Science
"... . In modern computer architecture the use of memory hierarchies causes a program's data locality to directly affect performance. Data locality occurs when a piece of data is still in a cache upon reuse. For dense matrix computations, loop transformations can be used to improve data locality. How ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
. In modern computer architecture the use of memory hierarchies causes a program's data locality to directly affect performance. Data locality occurs when a piece of data is still in a cache upon reuse. For dense matrix computations, loop transformations can be used to improve data locality. However, sparse matrix computations have non-affine loop bounds and indirect memory references which prohibit the use of compile time loop transformations. This paper describes an algorithm to tile at runtime called serial sparse tiling. We test a runtime tiled version of sparse Gauss-Seidel on 4 different architectures where it exhibits speedups of up to 2.7. The paper also gives a static model for determining tile size and outlines how overhead affects the overall speedup. 1
Optimization and profiling of the cache performance of parallel lattice Boltzmann codes in 2D and 3D
- PARALLEL PROCESSING LETTERS
, 2003
"... When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchic ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the single-CPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the effectiveness of our optimization techniques.

