• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Revisiting sorting for GPGPU stream architectures. (2010)

by D Merrill, A Grimshaw
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 26
Next 10 →

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization ∗

by Michael Bauer, Henry Cook, Brucek Khailany
"... As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip m ..."
Abstract - Cited by 25 (2 self) - Add to MetaCart
As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x-3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs. 1.
(Show Context)

Citation Context

...barrier functionality is coarsegrained, but does allow waiting on multiple named barriers. Warp specialization has previously been proposed for efficient implementations of sorting algorithms on GPUs =-=[16]-=-. CudaDMA encapsulates this technique in order to make it more generally available to a range of application workloads. Virtualized warps were proposed by [15] as a way to deal with different tasks at...

Exposing fine-grained parallelism in algebraic multigrid methods

by Nathan Bell, Steven Dalton, Luke N. Olson , 2012
"... Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating mu ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, de-mands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construc-tion of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8 × in the setup phase and 5.7 × in the cycling phase when compared to a representative CPU implementation.
(Show Context)

Citation Context

... in linear algebra. Given the broad scope of their usage, special emphasis has been placed on the performance of primitives and very highly-optimized implementations are readily available for the GPU =-=[33, 28, 29]-=-. The efficiency of our solver, and hence the underlying parallel primitives, is demonstrated in Section 5. Our AMG solver is implemented almost exclusively with the parallel primitives provided by th...

Kernel Weaver: Automatically Fusing Database

by Haicheng Wu, Gregory Diamos, Srihari Cadambi, Sudhakar Yalamanchili - Primitives for Efficient GPU Computation.” MICRO , 2012
"... Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the microbenchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements. 1.
(Show Context)

Citation Context

...log language can be executed on GPUs. 3.1. Kernel Representation Kernel fusion is based on the multi-stage formulation of algorithms for the RA operators. Multi-stage algorithms are common to sorting =-=[31]-=-, pattern matching [39], algebraic multi-grid solvers [5], or compression [17]. This formulation is popular for GPU algorithms in particular since it enables one to separate the structured components ...

VoxelPipe : A Programmable Pipeline for 3D Vox elization Blending-Based Rasterization. HPG

by Jacopo Pantaleoni , 2011
"... Figure 1 : A rendering of the Stanford Dragon voxelized at a resolution of 512 3 , with a fragment shader encoding the surface normal packed in 16 bits. Max-blending has been used to deterministically select a single per-voxel normal, later used for lighting computations in the final rendering pass ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Figure 1 : A rendering of the Stanford Dragon voxelized at a resolution of 512 3 , with a fragment shader encoding the surface normal packed in 16 bits. Max-blending has been used to deterministically select a single per-voxel normal, later used for lighting computations in the final rendering pass. Abstract We present a highly flexible and efficient software pipeline for programmable triangle voxelization. The pipeline, entirely written in CUDA, supports both fully conservative and thin voxelizations, multiple boolean, floating point, vector-typed render targets, user-defined vertex and fragment shaders, and a bucketing mode which can be used to generate 3D A-buffers containing the entire list of fragments belonging to each voxel. For maximum efficiency, voxelization is implemented as a sort-middle tile-based rasterizer, while the A-buffer mode, essentially performing 3D binning of triangles over uniform grids, uses a sort-last pipeline. Despite its major flexibility, the performance of our tile-based rasterizer is always competitive with and sometimes more than an order of magnitude superior to that of state-of-the-art binary voxelizers, whereas our bucketing system is up to 4 times faster than previous implementations. In both cases the results have been achieved through the use of careful load-balancing and high performance sorting primitives.
(Show Context)

Citation Context

...terial properties. The design tradeoffs of such a system are very similar to those found in making a standard rasterizer with 2D output, and we tried to explore these extensively. Given current hardware capabilities, we found chunking (sort-middle) pipelines to be the most efficient solution for blending-based rasterization, whereas for A-buffer generation we found that the best approach is a feedforward (sort-last) pipeline in which the input triangles are first batched by size and orientation. To sort triangles into tiles and fragments into voxels, we relied on efficient sorting primitives [Merrill and Grimshaw 2010], avoiding all inter-thread communication that would have been necessary using queues. We have also introduced improved algorithms for triangle/voxel overlap testing, and new algorithms for careful load-balancing at all levels of the computing hierarchy. The resulting system, implemented in CUDA, is always competitive in performance and sometimes greatly superior (up to 28 times) to state-ofthe-art binary voxelizers [Schwarz and Seidel 2010], despite its major flexibility, whereas our bucketing solution is up to 4 times faster than previous implementations of triangle binning algorithms used ...

Efficient Parallel Merge Sort for Fixed and Variable Length Keys

by Andrew Davidson, Michael Garland, David Tarjan, John D. Owens
"... We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a so ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a sorting rate of 250 MKeys/sec, which is about 2.5 times faster than Thrust merge sort performance, and 70 % faster than non-stable state-of-the-art GPU merge sorts. Building on this sorting algorithm, we develop a scheme for sorting variable-length key/value pairs, with a special emphasis on string keys. Sorting non-uniform, unaligned data such as strings is a fundamental step in a variety of algorithms, yet it has received comparatively little attention. To our knowledge, our system is the first published description of an efficient string sort for GPUs. We are able to sort strings at a rate of 70 MStrings/s on one dataset and up to 1.25 GB/s on another dataset using a GTX 580. 1.
(Show Context)

Citation Context

... these architectures. For fixed key lengths where direct manipulation of keys is allowed, radix sort on the GPU has proven to be very efficient, with recent implementations achieving over 1 GKeys/sec =-=[13]-=-. However, for long or variable-length keys (such as strings), radix sort is not as appealing an approach: the cost of radix sort scales with key length. Rather, comparison-based sorts such as merge s...

Parallel lossless data compression on the gpu

by Ritesh A. Patel, Andrew Davidson, Yao Zhang, John D. Owens, Jason Mak - In Innovative Parallel Computing , 2012
"... We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
We present parallel algorithms and implementations of a bzip2-like lossless data compression scheme for GPU architectures. Our approach parallelizes three main stages in the bzip2 compression pipeline: Burrows-Wheeler transform (BWT), move-to-front transform (MTF), and Huffman coding. In particular, we utilize a two-level hierarchical sort for BWT, design a novel scan-based parallel MTF algorithm, and implement a parallel reduction scheme to build the Huffman tree. For each algorithm, we perform detailed performance analysis, discuss its strengths and weaknesses, and suggest future directions for improvements. Overall, our GPU implementation is dominated by BWT performance and is 2.78 × slower than bzip2, with BWT and MTF-Huffman respectively 2.89 × and 1.34 × slower on average. 1.
(Show Context)

Citation Context

...ers of any size. Bottlenecks. The string sort in our BWT stage is the major bottleneck of our compression pipeline. Sorting algorithms on the GPU have been a popular topic of research in recent years =-=[8, 11, 15]-=-. The fastest known GPU-based radix-sort by Merrill and Grimshaw [11] sorts key-value pairs at a rate of 3.3 GB/s (GTX 480). String sorting, however, is to the best of our knowledge a new topic on the...

Parallel calculation of the median and order statistics . . .

by Gleb Beliakov , 2011
"... ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract not found

Applicability of GPU Computing for Efficient Merge in In-Memory Databases

by Jens Krueger, Martin Grund, Ingo Jaeckel, Dr. Alex, Er Zeier, Hasso Plattner
"... Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, upda ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, updated or deleted. This has to be taken into account for TPC-C like workloads with around 45 % of all queries being transactional modifications. A technique called differential updates can be used to allow faster modifications. In addition to the main storage, the database then maintains a delta storage to accommodate modifying queries. During the merge process, the modifications of the delta are merged with the main storage in parallel to the normal operation of the database. Current hardware and software trends suggest that this problem can be tackled by massively parallelizing the merge process. One approach to massive parallelism are GPUs that offer order of magnitudes more cores than modern CPUs. Therefore, we analyze the feasibility of a parallel GPU merge implementation and its potential speedup. We found that the maximum potential merge speedup is limited since only two of its four stages are likely to benefit from parallelization. We present a parallel dictionary slice merge algorithm as well as an alternative parallel merge algorithm for GPUs that achieves up to 40 % more throughput than its CPU implementation. In addition, we propose a parallel duplicate removal algorithm that achieves up to 27 times the throughput of the CPU implementation. 1.
(Show Context)

Citation Context

...Publication Author Device CPU Rate 2009 [23] NVIDIA GTX 280 200 2010 [25] Intel Knights Ferry vs. GTX 280 560 176 2010 [24] Intel Core i7 vs. GTX 280 250 200 2009 [13] University Tesla C1060 300 2010 =-=[14]-=- University GTX 285 550 2011 [15] University GTX 480 1005 Table 2: Comparison of different reported sort implementations on CPUs and GPUs. The rates describe the number of random 32-bit keys sorted pe...

Raytracing dynamic scenes on the GPU using grids

by Sashidhar Guntury, P. J. Narayanan - IEEE Transactions on Visualization and Computer Graphics
"... Abstract—Raytracing dynamic scenes at interactive rates have received a lot of attention recently. We present a few strategies for high performance raytracing on a commodity GPU. The construction of grids needs sorting, which is fast on today’s GPUs. The grid is thus the acceleration structure of ch ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract—Raytracing dynamic scenes at interactive rates have received a lot of attention recently. We present a few strategies for high performance raytracing on a commodity GPU. The construction of grids needs sorting, which is fast on today’s GPUs. The grid is thus the acceleration structure of choice for dynamic scenes as per-frame rebuilding is required. We advocate the use of appropriate data structures for each stage of raytracing, resulting in multiple structure building per frame. A perspective grid built for the camera achieves perfect coherence for primary rays. A perspective grid built with respect to each light source provides the best performance for shadow rays. Spherical grids handle lights positioned inside the model space and handle spotlights. Uniform grids are best for reflection and refraction rays with little coherence. We propose an Enforced Coherence method to bring coherence to them by rearranging the ray to voxel mapping using sorting. This gives the best performance on GPUs with only user-managed caches. We also propose a simple, Independent Voxel Walk method, which performs best by taking advantage of the L1 and L2 caches on recent GPUs. We achieve over 10 fps of total rendering on the Conference model with one light source and one reflection bounce, while rebuilding the data structure for each stage. Ideas presented here are likely to give high performance on the future GPUs as well as other manycore architectures.
(Show Context)

Citation Context

...other processes each ray independently using a thread, avoiding the sorting to enforce coherence. Enforcing coherence wins on older cache-less GPUs. The performance can improve as sorting gets faster =-=[28]-=-. The moderate amounts of L1 and L2 caches available on the latest GPUs tip the balance in favor of the second method. The performance of our grid-based approach deteriorates as the number of bounces ...

Sorting Large Multifield Records on a GPU

by Shibdas Bandyopadhyay, Sartaj Sahni
"... We extend the fastest comparison based (sample sort) and non-comparison based (radix sort) number sorting algorithms on a GPU to sort large multifield records. Two extensions- direct (the entire record is moved whenever its key is to be moved) and indirect ((key,index) pairs are sorted using the di ..."
Abstract - Add to MetaCart
We extend the fastest comparison based (sample sort) and non-comparison based (radix sort) number sorting algorithms on a GPU to sort large multifield records. Two extensions- direct (the entire record is moved whenever its key is to be moved) and indirect ((key,index) pairs are sorted using the direct extension and then records are ordered according to the obtained index permutation) are discussed. Our results show that for the ByField layout, the direct extension of the radix sort algorithm GRS [1] is the fastest for 32-bit keys when records have at least 12 fields; otherwise, the direct extension of the radix sort algorithm SRTS [13] is the fastest. For the Hybrid layout, the indirect extension of SRTS is the fastest.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University