Results 1  10
of
123
Linear Algebra Operators for GPU Implementation of Numerical Algorithms
 ACM Transactions on Graphics
, 2003
"... In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for ..."
Abstract

Cited by 317 (9 self)
 Add to MetaCart
In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms. In particular, we propose a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs. Besides performance gains due to improved numerical computations, graphics algorithms benefit from this model in that the transfer of computation results to the graphics processor for display is avoided. We demonstrate the effectiveness of our approach by implementing direct solvers for sparse matrices, and by applying these solvers to multidimensional finite difference equations, i.e. the 2D wave equation and the incompressible NavierStokes equations.
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
 ACM Trans. Graph
, 2003
"... Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given ..."
Abstract

Cited by 289 (3 self)
 Add to MetaCart
Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission
Scan Primitives for GPU Computing
 GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract

Cited by 170 (9 self)
 Add to MetaCart
The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrixvector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallowwater fluid simulation using the scan framework for a tridiagonal matrix solver.
Photon Mapping on Programmable Graphics Hardware
 GRAPHICS HARDWARE
, 2003
"... We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two met ..."
Abstract

Cited by 149 (4 self)
 Add to MetaCart
We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two methods: the first method is a multipass technique that uses fragment programs to directly sort the photons into a compact grid. The second method uses a single rendering pass combining a vertex program and the stencil buffer to route photons to their respective grid cells, producing an approximate photon map. We also present an efficient method for locating the nearest photons in the grid, which makes it possible to compute an estimate of the radiance at any surface location in the scene. Finally, we describe a breadthfirst stochastic ray tracer that uses the photon map to simulate full global illumination directly on the graphics hardware. Our implementation demonstrates that current graphics hardware is capable of fully simulating global illumination with progressive, interactive feedback to the user.
Mars: A MapReduce Framework on Graphics Processors
"... We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. Compared with commodity CPUs, GPUs have an order of mag ..."
Abstract

Cited by 129 (11 self)
 Add to MetaCart
(Show Context)
We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a specialpurpose coprocessor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains hundreds of processors, and evaluated it in comparison with Phoenix, the stateoftheart MapReduce framework on multicore processors. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPUbased counterpart for six common web applications on a quadcore machine. Additionally, we integrated Mars with Phoenix to perform coprocessing between the GPU and the CPU for further performance improvement. 1.
Fast Computation of Database Operations using Graphics Processors
"... We present new algorithms on commodity graphics processors to perform fast computation of several common database operations. Specifically, we consider operations such as conjunctive selections, aggregations, and semilinear queries, which are essential computational components of typical database, ..."
Abstract

Cited by 112 (12 self)
 Add to MetaCart
(Show Context)
We present new algorithms on commodity graphics processors to perform fast computation of several common database operations. Specifically, we consider operations such as conjunctive selections, aggregations, and semilinear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semilinear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA’s GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPUbased algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective coprocessor for performing database operations.
Understanding the Efficiency of GPU Algorithms for MatrixMatrix Multiplication
, 2004
"... Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model&a ..."
Abstract

Cited by 93 (1 self)
 Add to MetaCart
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an indepth analysis of dense matrixmatrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrixmatrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cacheaware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.
RealTime ConsensusBased Scene Reconstruction using Commodity Graphics Hardware
, 2002
"... that effectively combines a planesweeping algorithm with view synthesis for realtime, online 3D scene acquisition and view synthesis. Using realtime imagery from a few calibrated cameras, our method can generate new images from nearby viewpoints, estimate a dense depth map from the current viewp ..."
Abstract

Cited by 81 (7 self)
 Add to MetaCart
that effectively combines a planesweeping algorithm with view synthesis for realtime, online 3D scene acquisition and view synthesis. Using realtime imagery from a few calibrated cameras, our method can generate new images from nearby viewpoints, estimate a dense depth map from the current viewpoint, or create a textured triangular mesh. We can do each of these without any prior geometric information or requiring any user interaction, in real time and on line. The heart of our method is to use programmable Pixel Shader technology to square intensity differences between reference image pixels, and then to choose final colors (or depths) that correspond to the minimum difference, i.e. the most consistent color.
Lugpu: Efficient algorithms for solving dense linear systems on graphics hardware
 in SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing
, 2005
"... We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the com ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the computation to utilize multiple vertex and fragment processors. We also use appropriate data representations to match the rasterization order and cache technology of graphics processors. We have implemented our algorithm on different GPUs and compared the performance with optimized CPU implementations. In particular, our implementation on a NVIDIA GeForce 7800 GPU outperforms a CPUbased ATLAS implementation. Moreover, our results show that our algorithm is cache and bandwidth efficient and scales well with the number of fragment processors within the GPU and the core GPU clock rate. We use our algorithm for fluid flow simulation and demonstrate that the commodity GPU is a useful coprocessor for many scientific applications. 1
Relational joins on graphics processors
, 2007
"... We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming ..."
Abstract

Cited by 69 (12 self)
 Add to MetaCart
(Show Context)
We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming model for generalpurpose computing. Taking advantage of these new features, we design a set of dataparallel primitives such as scan, scatter and split, and use these primitives to implement indexed or nonindexed nestedloop, sortmerge and hash joins. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU and use parallel computation to effectively hide the memory latency. We have implemented our algorithms on a PC with an NVIDIA G80 GPU and an Intel P4 dualcore CPU. Our GPUbased algorithms are able to achieve 220 times higher performance than their CPUbased counterparts.