Results 1  10
of
142
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
Copperhead: Compiling an embedded data parallel language
 In Principles and Practices of Parallel Programming, PPoPP’11
, 2011
"... Modern parallel microprocessors deliver high performance on applications that expose substantial finegrained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in lowlevel languages is often an unnecessarily difficult task. T ..."
Abstract

Cited by 62 (4 self)
 Add to MetaCart
(Show Context)
Modern parallel microprocessors deliver high performance on applications that expose substantial finegrained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in lowlevel languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a highlevel data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient lowlevel implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45100 % of the performance of handcrafted, well optimized CUDA code.
Multicore bundle adjustment
 In IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2011
"... We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overco ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overcoming the severe memory and bandwidth limitations of current generation GPUs not only leads to more space efficient algorithms, but also to surprising savings in runtime. Our CPU based system is up to ten times and our GPU based system is up to thirty times faster than the current state of the art methods [1], while maintaining comparable convergence behavior. The code and additional results are available at
A Quantitative Performance Analysis Model for GPU Architectures
 In HPCA
, 2011
"... We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improv ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
(Show Context)
We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmarkbased approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU’s native instruction set, we can predict performance with a 5–15 % error. To demonstrate the usefulness of the model, we analyze three representative realworld and already highlyoptimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60 % and 18 % respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity. 1
Efficient, highquality image contour detection
 In IEEE International Conference on Computer Vision
, 2009
"... Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline bat ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector, along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector provides uncompromised contour accuracy, with an Fmetric of 0.70 on the Berkeley Segmentation Dataset. Runtime is reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable highquality image contour detection on much larger images than previously practical, and the algorithms we propose are applicable to several image segmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increased performance in many computer vision applications. 1.
Mint: Realizing CUDA Performance in 3D Stencil Methods with Annotated C
 In Proceedings of the 25th International Conference on Supercomputing (ICS’11
, 2011
"... We present Mint, a programming model that enables the nonexpert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
We present Mint, a programming model that enables the nonexpert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a sourcetosource translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly handoptimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80 % of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.
EigenCFA: Accelerating Flow Analysis with GPUs
"... We describe, implement and benchmark EigenCFA, an algorithm for accelerating higherorder controlflow analysis (specifically, 0CFA) with a GPU. Ultimately, our program transformations, reductions and optimizations achieve a factor of 72 speedup over an optimized CPU implementation. We began our inv ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
(Show Context)
We describe, implement and benchmark EigenCFA, an algorithm for accelerating higherorder controlflow analysis (specifically, 0CFA) with a GPU. Ultimately, our program transformations, reductions and optimizations achieve a factor of 72 speedup over an optimized CPU implementation. We began our investigation with the view that GPUs accelerate higharithmetic, dataparallel computations with a poor tolerance for branching. Taking that perspective to its limit, we reduced Shivers’s abstractinterpretive 0CFA to an algorithm synthesized from linearalgebra operations. Central to this reduction were “abstract” domains as vectors and matrices. A straightforward (densematrix) implementation of EigenCFA performed slower than a fast CPU implementation. Ultimately, sparsematrix data structures and operations turned out to be the critical accelerants. Because controlflow graphs are sparse in practice (up to 96 % empty), our controlflow matrices are also sparse, giving the sparse matrix operations an overwhelming space and speed advantage. We also achieved speedups by carefully permitting data races. The monotonicity of 0CFA makes it sound to perform analysis operations in parallel, possibly using stale or even partiallyupdated data.
Exposing finegrained parallelism in algebraic multigrid methods
, 2012
"... Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating mu ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughputoriented processors, such as the GPU, demands algorithms with abundant finegrained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial finegrained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8 × in the setup phase and 5.7 × in the cycling phase when compared to a representative CPU implementation.
On the Limits of GPU Acceleration
"... This paper throws a small “wet blanket ” on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factoriz ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
This paper throws a small “wet blanket ” on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method—exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equaleffort CPU tuning and consideration of realistic workloads and callingcontexts, we can with two modern quadcore CPU sockets roughly match one or two
Accelerating nearest neighbor search on manycore systems
, 2011
"... Abstract—We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their searc ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their search performance is provably sublinear in the size of the database, with a factor dependent only on its intrinsic dimensionality. We demonstrate that our methods provide substantial speedups on a range of datasets and hardware platforms. In particular, we present results on a 48core server machine, on graphics hardware, and on a multicore desktop. Keywordssimilarity search; metric spaces; parallel algorithms; manycore I.