Results 1  10
of
107
Implementation of a Portable Nested DataParallel Language
 Journal of Parallel and Distributed Computing
, 1994
"... This paper gives an overview of the implementation of Nesl, a portable nested dataparallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel alg ..."
Abstract

Cited by 182 (27 self)
 Add to MetaCart
This paper gives an overview of the implementation of Nesl, a portable nested dataparallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of dataparallel languages: a simple programming model and portability. The current Nesl implementation is based on an intermediate language called Vcode and a library of vector routines called Cvl. It runs on the Connection Machine CM2, the Cray YMP C90, and serial machines. We compare initial benchmark results of Nesl with those of machinespecific code on these machines for three algorithms: leastsquares linefitting, median finding, and a sparsematrix vector product. These results show that Nesl's performance is competitive with that of machinespecific codes for regular dense da...
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Understanding the Efficiency of Ray Traversal on GPUs
"... We discuss the mapping of elementary ray tracing operations— acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods hav ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
We discuss the mapping of elementary ray tracing operations— acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody knows whether the methods are anywhere near the theoretically obtainable limits, and if not, what might be causing the discrepancy. We study this question by comparing the measurements against a simulator that tells the upper bound of performance for a given kernel. We observe that previously known methods are a factor of 1.5–2.5X off from theoretical optimum, and most of the gap is not explained by memory bandwidth, but rather by previously unidentified inefficiencies in hardware work distribution. We then propose a simple solution that significantly narrows the gap between simulation and measurement. This results in the fastest GPU ray tracer to date. We provide results for primary, ambient occlusion and diffuse interreflection rays.
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus mini ..."
Abstract

Cited by 66 (5 self)
 Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarsegrain parallelism i...
Efficient sparse matrixvector multiplication on CUDA
, 2008
"... The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its rol ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
(Show Context)
The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the finegrained parallel architecture of the GPU. Given the memorybound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are wellstructured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity. On structured, gridbased matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finiteelement matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior stateoftheart studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quadcore Intel Clovertown system.
Relational joins on graphics processors
, 2007
"... We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
(Show Context)
We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming model for generalpurpose computing. Taking advantage of these new features, we design a set of dataparallel primitives such as scan, scatter and split, and use these primitives to implement indexed or nonindexed nestedloop, sortmerge and hash joins. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU and use parallel computation to effectively hide the memory latency. We have implemented our algorithms on a PC with an NVIDIA G80 GPU and an Intel P4 dualcore CPU. Our GPUbased algorithms are able to achieve 220 times higher performance than their CPUbased counterparts. 1.
Implementing decision trees and forests on a GPU
 In Proceedings 10th European Conference on Computer Vision
, 2008
"... Abstract. We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2 ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, nonbranching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms. We demonstrate results for object recognition which are identical to those obtained on a CPU, obtained in about 1 % of the time. To our knowledge, this is the first time a method has been proposed which is capable of evaluating or training decision trees on a GPU. Our method leverages the full parallelism of the GPU. Although we use features common to computer vision to demonstrate object recognition, our framework can accommodate other kinds of features for more general utility within computer science. 1
Automated Dynamic Analysis of CUDA Programs
"... Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them for generalpurpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a Clike language to code algorithms for execution on the GPU. Unfortunately, parallel programs are ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them for generalpurpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a Clike language to code algorithms for execution on the GPU. Unfortunately, parallel programs are prone to subtle correctness and performance bugs, and Cuda tool support for solving these remains a work in progress. As a first step towards addressing these problems, we present an automated analysis technique for finding two specific classes of bugs in Cuda programs: race conditions, which impact program correctness, and shared memory bank conflicts, which impact program performance. Our technique automatically instruments a program in two ways: to keep track of the memory locations accessed by different threads, and to use this data to determine whether bugs exist in the program. The instrumented source code can be run directly in Cuda’s device emulation mode, and any potential errors discovered will be automatically reported to the user. This automated analysis can help programmers find and solve subtle bugs in programs that are too complex to analyze manually. Although these issues are explored in the context of Cuda programs, similar issues will arise in any sufficiently “manycore ” architecture. 1.
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection (Extended Abstract)
, 1996
"... David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data. This paper presents a fast and portable parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank i, for an arbitrarily given integer i. Practical algorithms needed by our selection algorithm for the dynamic redistribution of data are also discussed. Our general framework is a dis tributed memory programming model enhanced by a set of communication primitives. We use efficient techniques for distributing, coalescing, and load balancing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in SPLITC and run on a varie ,ty of platforms, including the Thinking Machines CM5, IBM SP1 and SP2, Cray Research T3D, Meiko Scientific CS2, Intel Paragon, and workstation clusters. Our experimental results illustrate the scalability and efficiency of our algorithms across different platforms and improve upon all the related experimental results known to the authors.
Nepal  Nested DataParallelism in Haskell
 IN EUROPAR ’01
, 2001
"... This paper discusses an extension of Haskell by support for nested dataparallel programming in the style of the specialpurpose language Nesl. More precisely, the extension consists of a parallel array type, array comprehensions, and a set of primitive parallel array operations. This extension brin ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
This paper discusses an extension of Haskell by support for nested dataparallel programming in the style of the specialpurpose language Nesl. More precisely, the extension consists of a parallel array type, array comprehensions, and a set of primitive parallel array operations. This extension brings a hitherto unsupported style of parallel programming to Haskell. Moreover, nested data parallelism should receive wider attention when available in a standardised language like Haskell. This paper outlines the language extension and demonstrates its usefulness with two case studies.