Results 1 - 10
of
131
Scan Primitives for GPU Computing
- GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract
-
Cited by 170 (9 self)
- Add to MetaCart
The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.
CUDA cuts: Fast graph cuts on the GPU
- in Computer Vision and Pattern Recognition Workshops. IEEE Computer Society
"... Graph Cuts has become a powerful and popular optimization tool for energies defined over an MRF and has found applications in image segmentation, stereo vision, image restoration etc. The maxflow/mincut algorithm to compute graph cuts is computationally expensive. The best reported implementation of ..."
Abstract
-
Cited by 65 (8 self)
- Add to MetaCart
(Show Context)
Graph Cuts has become a powerful and popular optimization tool for energies defined over an MRF and has found applications in image segmentation, stereo vision, image restoration etc. The maxflow/mincut algorithm to compute graph cuts is computationally expensive. The best reported implementation of it takes over 140 milliseconds even on images of size 640×480 for two labels and cannot be used for real time applications. The commodity Graphics Processor Unit (GPU) has emerged as an economical and fast parallel co-processor recently. In this paper, we present an implementation of the push-relabel algorithm for graph cuts on the GPU. We show our results on some benchmark dataset and some synthetic images. We can perform over 25 graph cuts per second on 640×480 size benchmark images and over 35 graph cuts per second on 1K × 1K size synthetic images on an Nvidia GTX 280. The time for each complete graph-cut is few milliseconds when only a few edge weights change from the previous graphs, as on dynamic graphs. The CUDA code with a well-defined interface can be downloaded from
A Fast Similarity Join Algorithm Using Graphics Processing Units
"... Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similari ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
(Show Context)
Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similarity join algorithm called LSS is presented that executes on a Graphics Processing Unit (GPU), exploiting its parallelism and high data throughput. As GPUs only allow simple data operations such as the sorting and searching of arrays, LSS uses these two operations to cast a similarity join operation as a GPU sort-and-search problem. It first creates, on the fly, a set of space-filling curves on one of its input datasets, using a parallel GPU sort routine. Next, LSS processes each point p of the other dataset in parallel. For each p, it searches an interval of one of the space-filling curves guaranteed to contain all the pairs in which p participates. Using extensive theoretical and experimental analysis, LSS is shown to offer a good balance between time and work efficiency. Experimental results demonstrate that LSS is suitable for similarity joins in large high-dimensional datasets, and that it performs well when compared against two existing prominent similarity join methods. I.
Efficient Computation of Sum-products on GPUs Through Software-Managed Cache
, 2008
"... We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
(Show Context)
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low computeto-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4 GHz Core 2 with a 4 MB L2 cache.
D.: Real-time Reyes-style adaptive surface subdivision
- ACM Transactions on Graphics
"... Figure 1: Flat-shaded OpenGL renderings of Reyes-subdivided surfaces, showing eye-space grids generated during the Bound & Split loop. We present a GPU based implementation of Reyes-style adaptive surface subdivision, known in Reyes terminology as the Bound/Split and Dice stages. The performance ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Figure 1: Flat-shaded OpenGL renderings of Reyes-subdivided surfaces, showing eye-space grids generated during the Bound & Split loop. We present a GPU based implementation of Reyes-style adaptive surface subdivision, known in Reyes terminology as the Bound/Split and Dice stages. The performance of this task is important for the Reyes pipeline to map efficiently to graphics hardware, but its recursive nature and irregular and unbounded memory requirements present a challenge to an efficient implementation. Our solution begins by characterizing Reyes subdivision as a work queue with irregular computation, targeted to a massively parallel GPU. We propose efficient solutions to these general problems by casting our solution in terms of the fundamental primitives of prefix-sum and reduction, often encountered in parallel and GPGPU environments. Our results indicate that real-time Reyes subdivision can indeed be obtained on today’s GPUs. We are able to subdivide a complex model to subpixel accuracy within 15 ms. Our measured performance is several times better than that of Pixar’s RenderMan. Our implementation scales well with the input size and depth of subdivision. We also address concerns of memory size and bandwidth, and analyze the feasibility of conventional ideas on screen-space buckets.
Real-Time View-Dependent Rendering of Parametric Surfaces
"... Figure 1: We adaptively subdivide rational Bézier patches until a view-dependent error metric is satisfied. For a 1600x1200 image of the car model (right) we render 192k quads at 143 fps on a NVIDIA GTX 280 – including CUDA transfer overheads, texturing, Phong shading, and 16x multisampling. We prop ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Figure 1: We adaptively subdivide rational Bézier patches until a view-dependent error metric is satisfied. For a 1600x1200 image of the car model (right) we render 192k quads at 143 fps on a NVIDIA GTX 280 – including CUDA transfer overheads, texturing, Phong shading, and 16x multisampling. We propose a view-dependent adaptive subdivision algorithm for rendering parametric surfaces on parallel hardware. Our framework allows us to bound the screen space error of a piecewise linear approximation. We naturally assign more primitives to curved areas while keeping quads large for flatter parts of the model and avoid cracks resulting from the polygonal approximation of nonuniform patch subdivision. The overall algorithm is simple, fits current GPUs extremely well, and is surprisingly fast while producing little to no artifacts.
Using graphics devices in reverse: GPU-based Image Processing and Computer Vision
- IEEE International Conference on Multimedia and Expo
, 2008
"... Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert “numbers into pictures ” (i.e. computer graphics). In this paper, we discus the use of GPUs in approximately the reverse way: to assist in “converting pictures into numbers” (i ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert “numbers into pictures ” (i.e. computer graphics). In this paper, we discus the use of GPUs in approximately the reverse way: to assist in “converting pictures into numbers” (i.e. computer vision). For graphical operations, GPUs currently provide many hundreds of gigaflops of processing power. This paper discusses how this processing power is being harnessed for Image Processing and Computer Vision, thereby providing dramatic speedups on commodity, readily available graphics hardware. A brief review of algorithms mapped to the GPU by using the graphics API for vision is presented. The recent NVIDIA CUDA programming model is then introduced as a way of expressing program parallelism without the need for graphics expertise.
H.: Evaluating performance and portability of OpenCL programs
- Proc. Automatic Performance tuning, 2010
"... Abstract. Recently, OpenCL, a new open programming standard for GPGPU programming, has become available in addition to CUDA. OpenCL can support various compute devices due to its higher abstraction pro-gramming framework. Since there is a semantic gap between OpenCL and compute devices, the OpenCL C ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Recently, OpenCL, a new open programming standard for GPGPU programming, has become available in addition to CUDA. OpenCL can support various compute devices due to its higher abstraction pro-gramming framework. Since there is a semantic gap between OpenCL and compute devices, the OpenCL C compiler plays important roles to exploit the potential of compute devices and therefore its capabil-ity should be clarified. In this paper, the performance of CUDA and OpenCL programs is quantitatively evaluated. First, several CUDA and OpenCL programs of almost the same computations are developed, and their performances are compared. Then, the main factors causing their performance differences is investigated. The evaluation results suggest that the performances of OpenCL programs are comparable with those of CUDA ones if the kernel codes are appropriately optimized by hand or by the compiler optimizations. This paper also discusses the differences between NVIDIA and AMD OpenCL implementations by comparing the performances of their GPUs for the same programs. The performance comparison shows that the compiler options of the OpenCL C compiler and the execution configuration parameters have to be optimized for each GPU to obtain its best performance. Therefore, automatic param-eter tuning is essential to enable a single OpenCL code to run efficiently on various GPUs. 1
Hardware-efficient belief propagation
- in Proc. CVPR
, 2009
"... Abstract—Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixel-wise, and sequential operations of BP make ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixel-wise, and sequential operations of BP make it difficult to parallelize the computation. In this paper, we propose two techniques to address these issues. The first technique is a new message passing scheme named tile-based belief propagation that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles. The tile-wise processing also enables data reuse and pipeline, resulting in efficient hardware implementation. The second technique is an O(L) parallel message construction algorithm that exploits the properties of robust functions for parallelization. We apply these two techniques to a VLSI circuit for stereo matching that generates high-resolution disparity maps in near real-time. We also implement the proposed schemes on GPU which is four-time faster than standard BP on GPU. Index Terms—Belief propagation, Markov random field, energy minimization, embedded systems, VLSI circuit design, general-purpose computation on GPU (GPGPU). M I.