Results 1  10
of
75
Compiletime Composition of Runtime Data and Iteration Reorderings
, 2003
"... Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compiletime. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications. ..."
Abstract

Cited by 53 (9 self)
 Add to MetaCart
Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compiletime. Prior work has developed runtime reorderings of data and computation that enhance locality in such applications.
Sketching Stencils
"... Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve nontraditional transformations, whi ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
(Show Context)
Performance of stencil computations can be significantly improved through smart implementations that improve memory locality, computation reuse, or parallelize the computation. Unfortunately, efficient implementations are hard to obtain because they often involve nontraditional transformations, which means that they cannot be produced by optimizing the reference stencil with a compiler. In fact, many stencils are produced by code generators that were tediously handcrafted. In this paper, we show how stencil implementations can be produced with sketching. Sketching is a software synthesis approach where the programmer develops a partial implementation— a sketch—and a separate specification of the desired functionality given by a reference (unoptimized) stencil. The synthesizer then completes the sketch to behave like the specification, filling in code fragments that are difficult to develop manually. Existing sketching systems work only for small finite programs, i.e., programs that can be represented as small Boolean circuits. In this paper, we develop a sketching synthesizer that works for stencil computations, a large class of programs that, unlike circuits, have unbounded inputs and outputs, as well as an unbounded number of computations. The key contribution is a reduction algorithm that turns a stencil into a circuit, allowing us to synthesize stencils using an existing sketching synthesizer.
Streaming Multigrid for GradientDomain Operations on Large Images
"... We introduce a new tool to solve the large linear systems arising from gradientdomain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over outofcore data. This fast solution is enabled by a combination of three techniques: (1) use of ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
We introduce a new tool to solve the large linear systems arising from gradientdomain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over outofcore data. This fast solution is enabled by a combination of three techniques: (1) use of secondorder finite elements (rather than traditional finite differences) to reach sufficient accuracy in a single Vcycle, (2) temporally blocked relaxation, and (3) multilevel streaming to pipeline the restriction and prolongation phases into single streaming passes. A key contribution is the extension of the Bspline finiteelement method to be compatible with the forwarddifference gradient representation commonly used with images. Our streaming solver is also efficient for inmemory images, due to its fast convergence and excellent cache behavior. Remarkably, it can outperform spatially adaptive solvers that exploit applicationspecific knowledge. We demonstrate seamless stitching and tonemapping of gigapixel images in about an hour on a notebook PC. Keywords: outofcore multigrid solver, Bspline finite elements, Poisson equation, gigapixel images, multilevel streaming. 1
An overview of cache optimization techniques and cacheaware numerical algorithms
 In Proceedings of the GIDagstuhl Forschungseminar: Algorithms for Memory Hierarchies, volume 2625 of (LNCS
, 2003
"... In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today's computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which i ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
(Show Context)
In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today's computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is
Cacheefficient multigrid algorithms
 Int. J. High Perform. Comput. Appl
"... Abstract. Multigrid is widely used as an efficient solver for sparse linear systems arising from the discretization of elliptic boundary value problems. Linear relaxation methods like GaussSeidel and RedBlack GaussSeidel form the principal computational component of multigrid, and thus affect its ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Multigrid is widely used as an efficient solver for sparse linear systems arising from the discretization of elliptic boundary value problems. Linear relaxation methods like GaussSeidel and RedBlack GaussSeidel form the principal computational component of multigrid, and thus affect its efficiency. In the context of multigrid, these iterative solvers are executed for a small number of iterations (2–8). We exploit this property of the algorithm to develop a cacheefficient multigrid, by focusing on improving the memory behavior of the linear relaxation methods. The efficiency in our cacheefficient linear relaxation algorithm comes from two sources: reducing the number of data cache and TLB misses, and reducing the number of memory references by keeping values registerresident. Experiments on five modern computing platforms show a performance improvement of 1.15–2.7 times over a standard implementation of Full Multigrid VCycle. 1
Optimization and profiling of the cache performance of parallel lattice Boltzmann codes in 2D and 3D
 PARALLEL PROCESSING LETTERS
, 2003
"... When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the singleCPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchic ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the singleCPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the singleCPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the effectiveness of our optimization techniques.
Using GPUs to improve multigrid solver performance on a cluster
 J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
(Show Context)
This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPIbased Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Combining Performance Aspects of Irregular GaussSeidel via Sparse Tiling
 in 15th Workshop on Languages and Compilers for Parallel Computing (LCPC
, 2002
"... Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as GaussSeidel. To improve performance, iterative smoothers can exploit parallelism, intraiteration data reuse, and interiteration data reuse. Current met ..."
Abstract

Cited by 25 (12 self)
 Add to MetaCart
(Show Context)
Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as GaussSeidel. To improve performance, iterative smoothers can exploit parallelism, intraiteration data reuse, and interiteration data reuse. Current methods for parallelizing GaussSeidel on irregular grids, such as multicoloring and ownercomputes based techniques, exploit parallelism and possibly intraiteration data reuse but not interiteration data reuse. Sparse tiling techniques were developed to improve intraiteration and interiteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of GaussSeidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to ownercomputes based GaussSeidel methods. The latter employ only parallelism and intraiteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intraiteration, and interiteration data locality) are combined.
Avoiding Communication in Sparse Matrix Computations
 In Proceedings of IPDPS
, 2008
"... 1 ..."
(Show Context)