Results 1  10
of
85
Implementation of a Portable Nested DataParallel Language
 Journal of Parallel and Distributed Computing
, 1994
"... This paper gives an overview of the implementation of Nesl, a portable nested dataparallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel alg ..."
Abstract

Cited by 177 (26 self)
 Add to MetaCart
This paper gives an overview of the implementation of Nesl, a portable nested dataparallel language. This language and its implementation are the first to fully support nested data structures as well as nested dataparallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of dataparallel languages: a simple programming model and portability. The current Nesl implementation is based on an intermediate language called Vcode and a library of vector routines called Cvl. It runs on the Connection Machine CM2, the Cray YMP C90, and serial machines. We compare initial benchmark results of Nesl with those of machinespecific code on these machines for three algorithms: leastsquares linefitting, median finding, and a sparsematrix vector product. These results show that Nesl's performance is competitive with that of machinespecific codes for regular dense da...
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus mini ..."
Abstract

Cited by 63 (5 self)
 Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarsegrain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarsegrain parallelism i...
Understanding the Efficiency of Ray Traversal on GPUs
"... We discuss the mapping of elementary ray tracing operations— acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods hav ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
We discuss the mapping of elementary ray tracing operations— acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody knows whether the methods are anywhere near the theoretically obtainable limits, and if not, what might be causing the discrepancy. We study this question by comparing the measurements against a simulator that tells the upper bound of performance for a given kernel. We observe that previously known methods are a factor of 1.5–2.5X off from theoretical optimum, and most of the gap is not explained by memory bandwidth, but rather by previously unidentified inefficiencies in hardware work distribution. We then propose a simple solution that significantly narrows the gap between simulation and measurement. This results in the fastest GPU ray tracer to date. We provide results for primary, ambient occlusion and diffuse interreflection rays.
Efficient sparse matrixvector multiplication on CUDA
, 2008
"... The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its rol ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the finegrained parallel architecture of the GPU. Given the memorybound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are wellstructured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity. On structured, gridbased matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finiteelement matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior stateoftheart studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quadcore Intel Clovertown system.
Relational joins on graphics processors
, 2007
"... We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
We present our novel design and implementation of relational join algorithms for newgeneration graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient interprocessor communication through fast shared memory, and a programming model for generalpurpose computing. Taking advantage of these new features, we design a set of dataparallel primitives such as scan, scatter and split, and use these primitives to implement indexed or nonindexed nestedloop, sortmerge and hash joins. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU and use parallel computation to effectively hide the memory latency. We have implemented our algorithms on a PC with an NVIDIA G80 GPU and an Intel P4 dualcore CPU. Our GPUbased algorithms are able to achieve 220 times higher performance than their CPUbased counterparts. 1.
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection (Extended Abstract)
, 1996
"... David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data. This paper presents a fast and portable parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank i, for an arbitrarily given integer i. Practical algorithms needed by our selection algorithm for the dynamic redistribution of data are also discussed. Our general framework is a dis tributed memory programming model enhanced by a set of communication primitives. We use efficient techniques for distributing, coalescing, and load balancing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in SPLITC and run on a varie ,ty of platforms, including the Thinking Machines CM5, IBM SP1 and SP2, Cray Research T3D, Meiko Scientific CS2, Intel Paragon, and workstation clusters. Our experimental results illustrate the scalability and efficiency of our algorithms across different platforms and improve upon all the related experimental results known to the authors.
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Interactive depth of field using simulated diffusion
, 2006
"... Figure 1: Top: Pinhole camera image from an upcoming feature film. Bottom: Sample results of our depthoffield algorithm based on simulated diffusion. We generate these results from a single color and depth value per pixel, and the above images render at 23–25 frames per second. The method is desig ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Figure 1: Top: Pinhole camera image from an upcoming feature film. Bottom: Sample results of our depthoffield algorithm based on simulated diffusion. We generate these results from a single color and depth value per pixel, and the above images render at 23–25 frames per second. The method is designed to produce filmpreview quality at interactive rates on a GPU. Fast preview should allow greater artistic control of depthoffield effects. Accurate computation of depthoffield effects in computer graphics rendering is generally very time consuming, creating a problematic workflow for film authoring. The computation is particularly challenging because it depends on largescale spatiallyvarying filtering that must accurately respect complex boundaries. A variety of realtime algorithms have been proposed for games, but the compromises required to achieve the necessary frame rates have made them them unsuitable for film. Here we introduce an approximate depthoffield computation that is good enough for film preview, yet can be computed interactively on a GPU. The computation creates depthoffield blurs by simulating the heat equation for a nonuniform medium. Our alternating direction implicit solution gives rise to separable spatially varying recursive filters that can compute largekernel convolutions in constant time per pixel while respecting the boundaries between infocus and outoffocus objects. Recursive filters have traditionally been viewed as problematic for GPUs, but using the wellestablished method of cyclic reduction of tridiagonal systems, we are able to vectorize the computation and achieve interactive frame rates. Direction Implicit Methods, GPU, Tridiagonal Matrices, Cyclic Reduction. 1
Automated Dynamic Analysis of CUDA Programs
"... Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them for generalpurpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a Clike language to code algorithms for execution on the GPU. Unfortunately, parallel programs are ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them for generalpurpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a Clike language to code algorithms for execution on the GPU. Unfortunately, parallel programs are prone to subtle correctness and performance bugs, and Cuda tool support for solving these remains a work in progress. As a first step towards addressing these problems, we present an automated analysis technique for finding two specific classes of bugs in Cuda programs: race conditions, which impact program correctness, and shared memory bank conflicts, which impact program performance. Our technique automatically instruments a program in two ways: to keep track of the memory locations accessed by different threads, and to use this data to determine whether bugs exist in the program. The instrumented source code can be run directly in Cuda’s device emulation mode, and any potential errors discovered will be automatically reported to the user. This automated analysis can help programmers find and solve subtle bugs in programs that are too complex to analyze manually. Although these issues are explored in the context of Cuda programs, similar issues will arise in any sufficiently “manycore ” architecture. 1.