Results 1 - 10
of
14
Efficient Stream Compaction on Wide SIMD Many-Core Architectures
"... Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3 × speedup over previous published algorithms.
Parallel Generation of L-Systems
"... This paper introduces a solution to compute L-systems on parallel architectures like GPUs and multi-core CPUs. Our solution can split the derivation of the L-system as well as the interpretation and geometry generation into thousands of threads running in parallel. We introduce a highly parallel alg ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper introduces a solution to compute L-systems on parallel architectures like GPUs and multi-core CPUs. Our solution can split the derivation of the L-system as well as the interpretation and geometry generation into thousands of threads running in parallel. We introduce a highly parallel algorithm for L-system evaluation that works on arbitrary L-systems, including parametric productions, context sensitive productions, stochastic production selection, and productions with side effects. Further we directly interpret the productions defined in plain-text, without requiring any compilation or transformation step (e.g., into shaders). Our algorithm is efficient in the sense that it requires no explicit inter-thread communication or atomic operations, and is thus completely lock free. 1
(Guest Editors) Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing
"... We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into co ..."
Abstract
- Add to MetaCart
We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray-primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism – Raytracing. 1.
ORIGINAL ARTICLE Fast indirect illumination using Layered Depth Images
, 2010
"... Abstract We present a novel hybrid rendering method for diffuse and glossy indirect illumination. A scene is rendered using standard rasterization on a GPU. In a shader, secondary ray queries are used to sample incident light and to compute indirect lighting. We observe that it is more important to ..."
Abstract
- Add to MetaCart
Abstract We present a novel hybrid rendering method for diffuse and glossy indirect illumination. A scene is rendered using standard rasterization on a GPU. In a shader, secondary ray queries are used to sample incident light and to compute indirect lighting. We observe that it is more important to cast many rays than to have precise results for each ray. Thus, we approximate secondary rays by intersecting them with precomputed layered depth images of the scene. We achieve interactive to real-time frame rates including indirect diffuse and glossy effects. Keywords Real-time global illumination · Hybrid rendering 1
A Nearest Neighbor Data Structure for Graphics Hardware
"... Nearest neighbor search is a core computational task in database systems and throughout data analysis. It is also a major computational bottleneck, and hence an enormous body of research has been devoted to data structures and algorithms for accelerating the task. Recent advances in graphics hardwar ..."
Abstract
- Add to MetaCart
Nearest neighbor search is a core computational task in database systems and throughout data analysis. It is also a major computational bottleneck, and hence an enormous body of research has been devoted to data structures and algorithms for accelerating the task. Recent advances in graphics hardware provide tantalizing speedups on a variety of tasks and suggest an alternate approach to the problem: simply run brute force search on a massively parallel system. In this paper we marry the approaches with a novel data structure that can effectively make use of parallel systems such as graphics cards. The architectural complexities of graphics hardware—the high degree of parallelism, the small amount of memory relative to instruction throughput, and the single instruction, multiple data design—present significant challenges for data structure design. Furthermore, the brute force approach applies perfectly to graphics hardware, leading one to question whether an intelligent algorithm or data structure can even hope to outperform this basic approach. Despite these challenges and misgivings, we demonstrate that our data structure—termed a Random Ball Cover—provides significant speedups over the GPUbased brute force approach. 1.
Work distribution methods on GPUs
"... Due to their high thread and data parallelism, commodity GPU architectures currently provide very high performance and general programmability. Many algorithms have been successfully ported to GPUs, but several limitations have prevented scalable implementations of many less easily parallelizable re ..."
Abstract
- Add to MetaCart
Due to their high thread and data parallelism, commodity GPU architectures currently provide very high performance and general programmability. Many algorithms have been successfully ported to GPUs, but several limitations have prevented scalable implementations of many less easily parallelizable recursive and hierarchical algorithms. In this paper, we investigate general approaches for dynamic work distribution and balancing on GPUs to allow recursive algorithms such as hierarchy algorithms. We propose a new and simple method that instead employs only minimal synchronization between cores and explicit balancing, but is more suited to the properties of the architecture. We show an implementation of several applications on a current GPU and our results show that for applications with fine-grained parallelism it outperforms other currently used work distribution methods since it avoids limitations of GPU architectures and provides competitive performance on coarse-grained applications. 1
Parallel Algorithms for Real-time Motion Planning
, 2011
"... For decades, humans have dreamed of making cars that could drive themselves, so that travel would be less taxing, and the roads safer for everyone. Toward this goal, we have made strides in motion planning algorithms for autonomous cars, using a powerful new computing tool, the parallel graphics pro ..."
Abstract
- Add to MetaCart
For decades, humans have dreamed of making cars that could drive themselves, so that travel would be less taxing, and the roads safer for everyone. Toward this goal, we have made strides in motion planning algorithms for autonomous cars, using a powerful new computing tool, the parallel graphics processing unit (GPU). We propose a novel five-dimensional search space formulation that includes both spatial and temporal dimensions, and respects the kinematic and dynamic constraints on a typical automobile. With this formulation, the search space grows linearly with the length of the path, compared to the exponential growth of other methods. We also propose a parallel search algorithm, using the GPU to tackle the curse of dimensionality directly and increase the number of plans that can be evaluated by an order of magnitude compared to a CPU implementation. With this larger capacity, we can evaluate a dense sampling of plans combining lateral swerves and accelerations that represent a range of effective responses to more on-road driving scenarios than have previously been addressed in the literature. We contribute a cost function that evaluates many aspects of each candidate
Real-time Rendering of Dynamic Scenes under All-frequency Lighting using Integral Spherical Gaussian
"... Figure 1: Real-time rendering of fully dynamic scenes with all-frequency lighting and highly specular BRDFs. We propose an efficient rendering method for dynamic scenes under all-frequency environmental lighting. To render the surfaces of objects illuminated by distant environmental lighting, the tr ..."
Abstract
- Add to MetaCart
Figure 1: Real-time rendering of fully dynamic scenes with all-frequency lighting and highly specular BRDFs. We propose an efficient rendering method for dynamic scenes under all-frequency environmental lighting. To render the surfaces of objects illuminated by distant environmental lighting, the triple product of the lighting, the visibility function and the BRDF is integrated at each shading point on the surfaces. Our method represents the environmental lighting and the BRDF with a linear combination of spherical Gaussians, replacing the integral of the triple product with the sum of the integrals of spherical Gaussians over the visible region of the hemisphere. We propose a new form of spherical Gaussian, the integral spherical Gaussian, that enables the fast and accurate integration of spherical Gaussians with various sharpness over the visible region on the hemisphere. The integral spherical Gaussian simplifies the integration to a sum of four pre-integrated values, which are easily evaluated on-the-fly. With a combination of a set of spheres to approximate object geometries and the integral spherical Gaussian, our method can render object surfaces very efficiently. Our GPU implementation demonstrates realtime rendering of dynamic scenes with dynamic viewpoints, lighting, and BRDFs. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism —Color, shading, shadowing, and texture
1 Fast Sparse Level-Sets on Graphics Hardware
"... Abstract—The level set method is one of the most popular techniques for capturing and tracking deformable interfaces. Although level sets have demonstrated great potential in visualization and computer graphics applications, such as surface editing and physically-based modeling, their use for intera ..."
Abstract
- Add to MetaCart
Abstract—The level set method is one of the most popular techniques for capturing and tracking deformable interfaces. Although level sets have demonstrated great potential in visualization and computer graphics applications, such as surface editing and physically-based modeling, their use for interactive simulations has been limited due to the high computational demands involved. In this paper we address this computational challenge by leveraging the increased computing power of graphics processors, to achieve fast simulations based on level sets. Our efficient, sparse GPU level set method is substantially faster than other state-of-the-art, parallel approaches on both CPU and GPU hardware. We further investigate its performance through a method for surface reconstruction, based on GPU level sets. Our novel multi-resolution method for surface reconstruction from unorganized point clouds compares favorably with recent, existing techniques and other parallel implementations. Finally, we point out that both level set computations and rendering of level-set surfaces can be performed at interactive rates, even on large volumetric grids. Therefore, many applications based on level sets can benefit from our sparse level set method. Index Terms—Level set method, sparse representation, sorted tile list, surface reconstruction, octree. I.
Parallel Generation of Multiple L-Systems
"... This paper introduces a solution to compute L-systems on parallel architectures like GPUs and multi-core CPUs. Our solution can split the derivation of the L-system as well as the interpretation and geometry generation into thousands of threads running in parallel. We introduce a highly parallel alg ..."
Abstract
- Add to MetaCart
This paper introduces a solution to compute L-systems on parallel architectures like GPUs and multi-core CPUs. Our solution can split the derivation of the L-system as well as the interpretation and geometry generation into thousands of threads running in parallel. We introduce a highly parallel algorithm for L-system evaluation that works on arbitrary L-systems, including parametric productions, context sensitive productions, stochastic production selection, and productions with side effects. This algorithm is further extended to allow evaluation of multiple independent L-systems in parallel. In contrast to previous work, we directly interpret the productions defined in plain-text, without requiring any compilation or transformation step (e.g., into shaders). Our algorithm is efficient in the sense that it requires no explicit inter-thread communication or atomic operations, and is thus completely lock free. Keywords: L-systems, graphics hardware, parallel processing, real-time rendering

