Results 1 - 10
of
34
Fast bvh construction on gpus
- In Proc. Eurographics ’09
, 2009
"... We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses th ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications. 1.
The Scalable HeterOgeneous Computing (SHOC) benchmark suite
- in Proc. 3-rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU3
, 2010
"... Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge o ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC’s initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.
Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture ABSTRACT
"... Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementat ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than 0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results. Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel’s upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm. 1.
Efficient Stream Compaction on Wide SIMD Many-Core Architectures
"... Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3 × speedup over previous published algorithms.
Interactive Fluid-Particle Simulation using Translating Eulerian Grids
"... We describe an interactive system featuring fluid-driven animation that responds to moving objects. Our system includes a GPUaccelerated Eulerian fluid solver that is suited for real-time use because it is unconditionally stable, takes constant calculation time per frame, and provides good visual fi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We describe an interactive system featuring fluid-driven animation that responds to moving objects. Our system includes a GPUaccelerated Eulerian fluid solver that is suited for real-time use because it is unconditionally stable, takes constant calculation time per frame, and provides good visual fidelity. We dynamically translate the fluid simulation domain to track a user-controlled object. The fluid motion is visualized via its effects on particles which respond to the calculated fluid velocity field, but which are not constrained to stay within the bounds of the simulation domain. As particles leave the simulation domain, they seamlessly transition to purely particle-based motion, obscuring the point at which the fluid simulation ends. We additionally describe a hardware-accelerated volume rendering system that treats the particles as participating media and can render effects such as smoke, dust, or mist. Taken together, these components can be used to add fluid-driven effects to an interactive system without enforcing constraints on user motion, and without visual artifacts resulting from the finite extents of Eulerian fluid simulation methods.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
"... Abstract- Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract- Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30 % higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.
(Guest Editors) Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing
"... We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into co ..."
Abstract
- Add to MetaCart
We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray-primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism – Raytracing. 1.
Gravity
"... � Clothing, meaty chunks, organic creatures • Fluids (as well as simple particle systems) � Fluid emitting weapons, debris effects � Focus of this talkFluid feature requirements • Simulate a fluid that… � Doesn’t pass through rigid objects � Conserves volume • so puddles and splashes form when objec ..."
Abstract
- Add to MetaCart
� Clothing, meaty chunks, organic creatures • Fluids (as well as simple particle systems) � Fluid emitting weapons, debris effects � Focus of this talkFluid feature requirements • Simulate a fluid that… � Doesn’t pass through rigid objects � Conserves volume • so puddles and splashes form when objects are hit � Can move and be moved by rigid bodies • Can push objects and objects can float in bodies of water � Can flow anywhere in a large environment • Not contained to a small box • Also: multiple independent fluids per scene � Efficient parallel multi-fluid simulationNVIDIA PhysX Fluid Demo • Available for download on the web: http://www.nvidia.com/content/graphicsplus/us/download.asp Particle-Based Fluids • Particle systems are simple and fast • Without particle-particle interactions � Can use for spray, splashing, leaves, debris, sparks, etc. • With particle-particle interactions � Can use for volumetric fluid simulationSimple Particle Systems • Particles store mass, position, velocity, age, etc. • Integrate: d/dt x i = v i d/dt v i = f i /m i v i m i x i f i • Generated by emitters, deleted when age> lifetimeParticle-Particle Interaction • Fluid simulation with particles requires inter-particle forces • O(n 2) potential computations for n particles! h • Reduce to linear complexity O(n) by defining interaction cutoff distance hSpatial Hashing h h • Fill particles into grid with spacing h • Only search potential neighbors in adjacent cells • Map cells [i,j,k] into 1D array via hash function h(i,j,k) � [Teschner03] Navier-Stokes Equations v 2

