Results 1 - 10
of
37
Scan Primitives for GPU Computing
- GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract
-
Cited by 70 (4 self)
- Add to MetaCart
The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.
Real-Time View-Dependent Rendering of Parametric Surfaces
"... Figure 1: We adaptively subdivide rational Bézier patches until a view-dependent error metric is satisfied. For a 1600x1200 image of the car model (right) we render 192k quads at 143 fps on a NVIDIA GTX 280 – including CUDA transfer overheads, texturing, Phong shading, and 16x multisampling. We prop ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Figure 1: We adaptively subdivide rational Bézier patches until a view-dependent error metric is satisfied. For a 1600x1200 image of the car model (right) we render 192k quads at 143 fps on a NVIDIA GTX 280 – including CUDA transfer overheads, texturing, Phong shading, and 16x multisampling. We propose a view-dependent adaptive subdivision algorithm for rendering parametric surfaces on parallel hardware. Our framework allows us to bound the screen space error of a piecewise linear approximation. We naturally assign more primitives to curved areas while keeping quads large for flatter parts of the model and avoid cracks resulting from the polygonal approximation of nonuniform patch subdivision. The overall algorithm is simple, fits current GPUs extremely well, and is surprisingly fast while producing little to no artifacts.
A Fast Similarity Join Algorithm Using Graphics Processing Units
"... Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similari ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract — A similarity join operation A ⋊⋉ɛ B takes two sets of points A, B and a value ɛ ∈ R, and outputs pairs of points p ∈ A, q ∈ B, such that the distance D(p, q) ≤ ɛ. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A novel similarity join algorithm called LSS is presented that executes on a Graphics Processing Unit (GPU), exploiting its parallelism and high data throughput. As GPUs only allow simple data operations such as the sorting and searching of arrays, LSS uses these two operations to cast a similarity join operation as a GPU sort-and-search problem. It first creates, on the fly, a set of space-filling curves on one of its input datasets, using a parallel GPU sort routine. Next, LSS processes each point p of the other dataset in parallel. For each p, it searches an interval of one of the space-filling curves guaranteed to contain all the pairs in which p participates. Using extensive theoretical and experimental analysis, LSS is shown to offer a good balance between time and work efficiency. Experimental results demonstrate that LSS is suitable for similarity joins in large high-dimensional datasets, and that it performs well when compared against two existing prominent similarity join methods. I.
GPU-assisted surface reconstruction on locally-uniform samples. In: Meshing Roundtable
, 2008
"... Summary. In point-based graphics, surfaces are represented by point clouds without explicit connectivity. If the distribution of the points can be carefully controlled, surface reconstruction becomes a much easier problem. We present a simple, completely local surface reconstruction algorithm for in ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Summary. In point-based graphics, surfaces are represented by point clouds without explicit connectivity. If the distribution of the points can be carefully controlled, surface reconstruction becomes a much easier problem. We present a simple, completely local surface reconstruction algorithm for input point distributions that are locally uniform. The locality of the computation lets us handle large point sets using parallel and out-of-core methods. The algorithm can be implemented robustly with floating-point arithmetic. We demonstrate the simplicity, efficiency, and numerical stability of our algorithm with an out-of-core and parallel implementation using graphics hardware. 1
Exploiting inter-thread temporal locality for chip multithreading
- In IPDPS
, 2010
"... Abstract—Multi-core organizations increasingly support multiple threads per core. Threads on a core usually share a single first-level data cache, so thread schedulers must try to minimize cache contention among threads. While this has been studied for concurrent threads with disjoint working sets, ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract—Multi-core organizations increasingly support multiple threads per core. Threads on a core usually share a single first-level data cache, so thread schedulers must try to minimize cache contention among threads. While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed to improve inter-thread cache sharing. This paper proposes the symbiotic affinity scheduling (SAS) algorithm in which work is first partitioned according to the number of cores (i.e., the number of caches), and these partitions are then subdivided and scheduled among each core’s available thread contexts so that threads sharing a core operate on neighboring elements to maximize cache locality. We demonstrate this concept with a series of data-parallel benchmarks. Simulations on M5 achieve an average speedup of 1.69 × and 36 % energy savings over conventional scheduling techniques that are oblivious to whether threads share a cache. Even compared to an approach that extends oblivious scheduling to ensure that the sum of the threads ’ working sets fits in the cache, symbiotic affinity scheduling is able to exploit greater temporal locality and provide 30 % performance gains on average. Symbiosis also outperforms adaptive contention reduction techniques by 17%. Keywords-chip multithreading; data locality; fine-grained parallelism; data parallelism; task scheduling; I.
Predictive Simulation of HPC Applications
- THE IEEE 23RD INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA-09)
, 2009
"... The architectures which support modern supercomputing machinery are as diverse today, as at any point during the last twenty years. The variety of processor core arrangements, threading strategies and the arrival of heterogeneous computation nodes are driving modern-day solutions to petaflop speeds. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The architectures which support modern supercomputing machinery are as diverse today, as at any point during the last twenty years. The variety of processor core arrangements, threading strategies and the arrival of heterogeneous computation nodes are driving modern-day solutions to petaflop speeds. The increasing complexity of such systems, as well as codes written to take advantage of the new computational abilities, pose significant frustrations for existing techniques which aim to model and analyse the performance of such hardware and software. In this paper we demonstrate the use of post-execution analysis on trace-based profiles to support the construction of simulation-based models. This involves combining the runtime capture of call-graph information with computational timings, which in turn allows representative models of code behaviour to be extracted. The main advantage of this technique is that it largely automates performance model development, a burden associated with existing techniques. We demonstrate the capabilities of our approach using both the NAS Parallel Benchmark suite and a real-world supercomputing benchmark developed by the United Kingdom Atomic Weapons Establishment. The resulting models, developed in less than two hours per code, have a good degree of predictive accuracy. We also show how one of these models can be used to explore the performance of the code on over 16,000 cores, demonstrating the scalability of our solution.
Query-Driven Visualization of Time-Varying Adaptive Mesh Refinement Data
"... Abstract—The visualization and analysis of AMR-based simulations is integral to the process of obtaining new insight in scientific research. We present a new method for performing query-driven visualization and analysis on AMR data, with specific emphasis on time-varying AMR data. Our work introduce ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—The visualization and analysis of AMR-based simulations is integral to the process of obtaining new insight in scientific research. We present a new method for performing query-driven visualization and analysis on AMR data, with specific emphasis on time-varying AMR data. Our work introduces a new method that directly addresses the dynamic spatial and temporal properties of AMR grids that challenge many existing visualization techniques. Further, we present the first implementation of query-driven visualization on the GPU that uses a GPU-based indexing structure to both answer queries and efficiently utilize GPU memory. We apply our method to two different science domains to demonstrate its broad applicability. Index Terms—AMR, Query-Driven Visualization, Multitemporal Visualization 1
High-dimensional Planning on the GPU
"... Optimal heuristic searches such as A * search are commonly used for low-dimensional planning such as 2D path finding. These algorithms however, typically do not scale well to high-dimensional planning problems such as motion planning for robotic arms, computing motion ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Optimal heuristic searches such as A * search are commonly used for low-dimensional planning such as 2D path finding. These algorithms however, typically do not scale well to high-dimensional planning problems such as motion planning for robotic arms, computing motion
Clinical Evaluation of GPU-Based Cone Beam Computed Tomography.
"... Abstract. The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3-D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 seconds). In many situations, the short scanning ti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3-D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 seconds). In many situations, the short scanning time of CBCT is followed by a time consuming 3-D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256 3 takes up to 25 minutes on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high performance computing solutions at a low cost, allowing for use in applications to many scientific problems. We have implemented an algorithm for 3-D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Cor., Santa Clara, California),which was executed on a NVIDIA GeForce 8800GT. Our implementation results in improved reconstruction times from on the order of minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3-D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe differences that can occur between CPU and GPU based reconstructions. By using our approach, the computation time for 256 3 is reduced from 25 minutes on the CPU to 4.8 seconds on the GPU. The GPU reconstruction time for 512 3 is 11.3 seconds, and 1024 3 is 61.4 seconds. 1

