Results 11 - 20
of
130
A Nearest Neighbor Data Structure for Graphics Hardware
"... Nearest neighbor search is a core computational task in database systems and throughout data analysis. It is also a major computational bottleneck, and hence an enormous body of research has been devoted to data structures and algorithms for accelerating the task. Recent advances in graphics hardwar ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Nearest neighbor search is a core computational task in database systems and throughout data analysis. It is also a major computational bottleneck, and hence an enormous body of research has been devoted to data structures and algorithms for accelerating the task. Recent advances in graphics hardware provide tantalizing speedups on a variety of tasks and suggest an alternate approach to the problem: simply run brute force search on a massively parallel system. In this paper we marry the approaches with a novel data structure that can effectively make use of parallel systems such as graphics cards. The architectural complexities of graphics hardware—the high degree of parallelism, the small amount of memory relative to instruction throughput, and the single instruction, multiple data design—present significant challenges for data structure design. Furthermore, the brute force approach applies perfectly to graphics hardware, leading one to question whether an intelligent algorithm or data structure can even hope to outperform this basic approach. Despite these challenges and misgivings, we demonstrate that our data structure—termed a Random Ball Cover—provides significant speedups over the GPUbased brute force approach. 1.
A Memory Access Model for Highly-threaded Many-core Architectures
- SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Many-core architectures are excellent in hiding memory-access latency by low-overhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Many-core architectures are excellent in hiding memory-access latency by low-overhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more fine-grained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the Floyd-Warshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the Floyd-Warshall algorithm performs better on these machines.
Simulation of tracked vehicles on granular terrain leveraging
- University of Wisconsin-Madison: Madison
"... i ..."
(Show Context)
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
"... Abstract- Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract- Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30 % higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.
IJCNN GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy
"... Abstract — A novel GPU-based simulation of spiking neural networks is implemented as a hybrid system using Parker-Sochacki numerical integration method with adaptive order. Full single-precision floating-point accuracy for all model variables is achieved. The implementation is validated with exact m ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract — A novel GPU-based simulation of spiking neural networks is implemented as a hybrid system using Parker-Sochacki numerical integration method with adaptive order. Full single-precision floating-point accuracy for all model variables is achieved. The implementation is validated with exact matching of all neuron potential traces from GPU-based simulation versus those of a reference CPU-based simulation. A network of 4096 Izhikevich neurons simulated on an NVIDIA GTX260 device achieves real-time performance with a speedup of 9 compared to simulation executed on Opteron 285, 2.6-GHz device.
Parallel Quadtree Coding of Large-Scale Raster Geospatial Data on Multicore
"... Global remote sensing and large-scale environmental modeling have generated huge amounts of raster geospatial data. While the inherent data parallelism of large-scale raster geospatial data allows straightforward coarse-grained parallelization at the chunk level on CPUs, it is largely unclear how to ..."
Abstract
-
Cited by 8 (8 self)
- Add to MetaCart
(Show Context)
Global remote sensing and large-scale environmental modeling have generated huge amounts of raster geospatial data. While the inherent data parallelism of large-scale raster geospatial data allows straightforward coarse-grained parallelization at the chunk level on CPUs, it is largely unclear how to effectively exploit such data parallelism on massively parallel General Purpose Graphics Processing Units (GPGPUs) that require fine-grained parallelization. In this study, we have developed an efficient spatial data structure called BQ-Tree to code raster geospatial data by exploiting the uniform distributions of quadrants of bitmaps at the bitplanes of a raster. In addition to utilizing the chunk-level coarse grained parallelism on both multicore CPUs and GPGPUs, we have also developed two fine-grained parallelization schemes and their four implementations by using different system and application level optimization strategies. Experiments show that the best GPGPU implementation is capable of decoding a BQ-Tree encoded 16-bits NASA MODIS geospatial raster with 22,658*15,586 cells in 190 milliseconds, i.e., 1.86 billion cells per second, on an Nvidia C2050 GPU card. The performance achieves a 6X speedup when compared with the best dual quadcore CPU implementation and 239-290X speedups when compared with a baseline single thread CPU implementation. 1.
Obsidian: GPU Kernel Programming in Haskell
, 2011
"... Graphics Processing Units (GPUs) are evolving into powerful general purpose computing platforms. At first, GPU performance was driven by the requirements of 3D graphics computer games. To fit this workload, a GPU is a many-core processor suitable for the data-parallel programming paradigm. Today, GP ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Graphics Processing Units (GPUs) are evolving into powerful general purpose computing platforms. At first, GPU performance was driven by the requirements of 3D graphics computer games. To fit this workload, a GPU is a many-core processor suitable for the data-parallel programming paradigm. Today, GPUs come with hundreds of processing elements and a theoretical single precision floating point performance in the teraflop range. Because of the computing power of modern GPUs, programmers are increasingly interested in making use of them for non-graphics applications. This desire has given rise to the research field that studies General Purpose Computations on GPUs (GPGPU). The manufacturers of GPUs are also acknowledging this trend and are tailoring their GPUs to meet both the desires of those playing games and the GPGPU community. CUDA is NVIDIA’s tool-set for GPGPU programming on their GPUs. CUDA is a big improvement for the GPGPU programmer compared to what was available before. In the early days, the GPGPU programmer was forced to express the algorithm being implemented as a computer graphics computation. CUDA provides a C compiler and a set
State-of-the-art in heterogeneous computing
, 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as wel ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
GPU Merge Path- A GPU Merging Algorithm
"... Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow f ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types of memory access patterns. Merging two sorted arrays is a useful primitive and is a basic building block for numerous applications such as joining database queries, merging adjacency lists in graphs, and set intersection. An efficient parallel merging algorithm partitions the sorted input arrays into sets of non-overlapping sub-arrays that can be independently merged on multiple cores. For optimal performance, the partitioning should be done in parallel and should divide the input arrays such that each core receives an equal size of data to merge. In this paper, we present an algorithm that partitions the workload equally amongst the GPU Streaming Multiprocessors (SM). Following this, we show how each SM performs a parallel merge and how to divide the work so that all the GPU’s Streaming Processors (SP) are utilized. All stages in this algorithm are parallel. The new algorithm demonstrates good utilization of the GPU memory hierarchy. This approach demonstrates an average of 20X and 50X speedup over a sequential merge on the x86 platform for integer and floating point, respectively. Our implementation is 10X faster than the fast parallel merge supplied in the CUDA Thrust library.
Efficient Parallel Merge Sort for Fixed and Variable Length Keys
"... We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a so ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a sorting rate of 250 MKeys/sec, which is about 2.5 times faster than Thrust merge sort performance, and 70 % faster than non-stable state-of-the-art GPU merge sorts. Building on this sorting algorithm, we develop a scheme for sorting variable-length key/value pairs, with a special emphasis on string keys. Sorting non-uniform, unaligned data such as strings is a fundamental step in a variety of algorithms, yet it has received comparatively little attention. To our knowledge, our system is the first published description of an efficient string sort for GPUs. We are able to sort strings at a rate of 70 MStrings/s on one dataset and up to 1.25 GB/s on another dataset using a GTX 580. 1.