Results 1  10
of
31
Streaming Multigrid for GradientDomain Operations on Large Images
"... We introduce a new tool to solve the large linear systems arising from gradientdomain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over outofcore data. This fast solution is enabled by a combination of three techniques: (1) use of ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
We introduce a new tool to solve the large linear systems arising from gradientdomain image processing. Specifically, we develop a streaming multigrid solver, which needs just two sequential passes over outofcore data. This fast solution is enabled by a combination of three techniques: (1) use of secondorder finite elements (rather than traditional finite differences) to reach sufficient accuracy in a single Vcycle, (2) temporally blocked relaxation, and (3) multilevel streaming to pipeline the restriction and prolongation phases into single streaming passes. A key contribution is the extension of the Bspline finiteelement method to be compatible with the forwarddifference gradient representation commonly used with images. Our streaming solver is also efficient for inmemory images, due to its fast convergence and excellent cache behavior. Remarkably, it can outperform spatially adaptive solvers that exploit applicationspecific knowledge. We demonstrate seamless stitching and tonemapping of gigapixel images in about an hour on a notebook PC. Keywords: outofcore multigrid solver, Bspline finite elements, Poisson equation, gigapixel images, multilevel streaming. 1
Assembly of finite element methods on graphics processors
 International Journal for Numerical Methods in Engineering
"... Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30 or more in comparison to a well optimized serial implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finiteelement discretization. Copyright c©
Exposing finegrained parallelism in algebraic multigrid methods
, 2012
"... Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating mu ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughputoriented processors, such as the GPU, demands algorithms with abundant finegrained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial finegrained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8 × in the setup phase and 5.7 × in the cycling phase when compared to a representative CPU implementation.
GPU Acceleration of an Unmodified Parallel Finite Element NavierStokes Solver
 McIntire (Eds.), High Performance Computing & Simulation 2009
, 2009
"... We have previously suggested a minimally invasive approach to include hardware accelerators into an existing largescale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit fro ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
We have previously suggested a minimally invasive approach to include hardware accelerators into an existing largescale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit from the acceleration immediately, without changes to application code. In this paper we explore the limitations of our approach by accelerating a NavierStokes solver. This nonlinear saddle point problem is much more involved than our previous tests, and does not exhibit an equally favourable acceleration potential: Not all computational work is concentrated inside the linear solver. Nonetheless, we are able to achieve speedups of more than a factor of two on a small GPUenhanced cluster. We conclude with a discussion how our concept can be altered to further improve acceleration.
FEAST – Realisation of hardwareoriented Numerics for HPC simulations with Finite Elements, Concurrency and Computation: Practice and Experience 22 (6) (2010) 2247–2265, doi: \bibinfo{doi}{10.1002/cpe.1584
"... FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation of PDE problems on parallel HPC systems which implements the concept of ‘hardwareoriented numerics’, a holistic approach aiming at optimal performance for modern numerics. In this paper ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation of PDE problems on parallel HPC systems which implements the concept of ‘hardwareoriented numerics’, a holistic approach aiming at optimal performance for modern numerics. In this paper, we describe this concept and the modular design which enables applications built on top of FEAST to execute efficiently, without any code modifications, on commodity based clusters, the NEC SX 8 and GPUaccelerated clusters. We demonstrate good performance and weak and strong scalability for the prototypical Poisson problem and more challenging applications from solid mechanics and fluid dynamics. 1
Coprocessor acceleration of an unmodified parallel solid mechanics code with FEASTGPU
 IN THE INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... Feast is a hardwareoriented MPI based Finite Element solver toolkit. With the extension FeastGPU the authors have previously demonstrated that significant speedups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific coprocessors to a commodity base ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
Feast is a hardwareoriented MPI based Finite Element solver toolkit. With the extension FeastGPU the authors have previously demonstrated that significant speedups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific coprocessors to a commodity based cluster. In this paper we put the more general claim to the test: Applications based on Feast, that ran only on CPUs so far, can be successfully accelerated on a coprocessor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/coprocessor interaction than the Poisson example, and is thus better suited to assess the practicability of our acceleration approach. We present accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, we demonstrate in detail that the single precision execution of the coprocessor does not affect the final accuracy. We establish how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6to 2.6fold total speedup. Subsequent analysis reveals which measures will increase these factors further.
A Parallel Algebraic Multigrid Solver on Graphics Processing Units ⋆
"... Abstract. The paper presents a multiGPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCGAMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrixvector multiplication scheme underlying the PCG ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The paper presents a multiGPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCGAMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrixvector multiplication scheme underlying the PCGAMG algorithm is presented for the manycore GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multiGPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. 1
Fast conjugate gradients with multiple GPUs
 In ICCS ’09: Proceedings of the 9th International Conference on Computational Science
, 2009
"... Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs in ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs installed on a single mainboard. The solver achieves double precision accuracy with single precision GPUs, using a mixed precision iterative refinement algorithm. To achieve high computation speed, we propose a fast sparse matrixvector multiplication algorithm, which is the core operation of iterative solvers. The proposed multiplication algorithm efficiently utilizes GPU resources via caching, coalesced memory accesses and load balance between running threads. Experiments on wide range of matrices show that our matrixvector multiplication algorithm achieves up to 9.9 Gflops on single GeForce 8800 GTS card and CG implementation achieves up to 22.6 Gflops with four GPUs. 1
A fulldepth amalgamated parallel 3D geometric multigrid solver for GPU clusters
 In 49th AIAA Aerospace Science Meeting
, 2011
"... Numerical computations of incompressible flow equations with pressurebased algorithms necessitate the solution of an elliptic Poisson equation, for which multigrid methods are known to be very efficient. In our previous work we presented a duallevel (MPICUDA) parallel implementation of the Navier ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Numerical computations of incompressible flow equations with pressurebased algorithms necessitate the solution of an elliptic Poisson equation, for which multigrid methods are known to be very efficient. In our previous work we presented a duallevel (MPICUDA) parallel implementation of the NavierStokes equations to simulate buoyancydriven incompressible fluid flows on GPU clusters with simple iterative methods while focusing on the scalability of the overall solver. In the present study we describe the implementation and performance of a multigrid method to solve the pressure Poisson equation within our MPICUDA parallel incompressible flow solver. Various design decisions and algorithmic choices for multigrid methods are explored in light of NVIDIA’s recent Fermi architecture. We discuss how unique aspects of an MPICUDA implementation for GPU clusters is related to the software choices made to implement the multigrid method. We propose a new coarse grid solution method of embedded multigrid with amalgamation and show that the parallel implementation retains the numerical efficiency of the multigrid method. Performance measurements on the NCSA Lincoln and TACC Longhorn clusters are presented for up to 64 GPUs. I.
UCHPC – UnConventional High Performance Computing for Finite Element Simulations
"... Processor technology is still dramatically advancing and promises enormous improvements in processing data for the next decade. These improvements are driven by parallelisation and specialisation of resources, and ‘unconventional hardware ’ like GPUs or the Cell processor can be seen as forerunners ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Processor technology is still dramatically advancing and promises enormous improvements in processing data for the next decade. These improvements are driven by parallelisation and specialisation of resources, and ‘unconventional hardware ’ like GPUs or the Cell processor can be seen as forerunners of this development. At the same time, much smaller advances are expected in moving data; this means that the efficiency of many simulation tools – particularly based on Finite Elements which often lead to huge, but very sparse linear systems – is restricted by the cost of memory access. We explain our approach to combine efficient data structures and multigrid solver concepts, and discuss the influence of processor technology on numerical and algorithmic developments. Concepts of ‘hardwareoriented numerics ’ are described and their numerical and computational characteristics is examined based on implementations in Feast, a high performance solver toolbox for Finite Elements which is able to exploit unconventional hardware components as ‘FEM coprocessors’, on sequential as well as on massively parallel computers. Finally, we demonstrate prototypically how these algorithmic and computational concepts can be applied to solid mechanics problems, and we present simulations on heterogeneous parallel computers with more than one billion unknowns. 1