Results 1  10
of
37
Scan Primitives for GPU Computing
 GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract

Cited by 170 (9 self)
 Add to MetaCart
The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrixvector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallowwater fluid simulation using the scan framework for a tridiagonal matrix solver.
merging, and sorting in parallel models of computation
 in “Proc. 14th Annual ACM Sympos. on Theory of Cornput
, 1982
"... A variety of models have been proposed for the study of synchronous parallel computation. These models are reviewed and some prototype problems are studied further. Two classes of models are recognized, fixed connection networks and models based on a shared memory. Routing and sorting are prototype ..."
Abstract

Cited by 112 (3 self)
 Add to MetaCart
(Show Context)
A variety of models have been proposed for the study of synchronous parallel computation. These models are reviewed and some prototype problems are studied further. Two classes of models are recognized, fixed connection networks and models based on a shared memory. Routing and sorting are prototype problems for the networks; in particular, they provide the basis for simulating the more powerful shared memory models. It is shown that a simple but important class of deterministic strategies (oblivious routing) is necessarily inefficient with respect to worst case analysis. Routing can be viewed as a special case of sorting, and the existence of an O(log n) sorting algorithm for some n processor fixed connection network has only recently been established by Ajtai, Komlos, and Szemeredi (“15th ACM Sympos. on Theory of Cornput., ” Boston, Mass., 1983, pp. l9). If the more powerful class of shared memory models is considered then it is possible to simply achieve an O(log n loglog n) sort via Valiant’s parallel merging algorithm, which it is shown can be implemented on certain models. Within a spectrum of shared memory models, it is shown that loglogn is asymptotically optimal for n processors to merge two sorted lists containing n elements. 0 1985 Academic Press, Inc.
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential pr ..."
Abstract

Cited by 93 (2 self)
 Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replaceadd that permits multiple replaceadds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replaceadds updating the same variable is handled particularly well: If every PE simultaneously addresses a replaceadd at the same variable, all these requests are satisfied in the time required to process just one request.
Solution of Partial Differential Equations on Vector Computers
 Proc. 1977 Army Numerical Analysis and Computers Conference
, 1977
"... In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with particular attention paid to those characteristics t ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selecUon. Both direct and iteraUve methods are given for elliptic equations as well as explicit and implicit methods for initialboundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restricUons or the lack of adequate algorithms. A brief discussion of application areas utilizing these computers is included.
Parallel Algorithms with Processor Failures and Delays
, 1995
"... We study efficient deterministic parallel algorithms on two models: restartable failstop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an online adversary and involving loss of private but not shared ..."
Abstract

Cited by 54 (12 self)
 Add to MetaCart
We study efficient deterministic parallel algorithms on two models: restartable failstop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an online adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed fixedsize update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serializaton of the actions of the processors determined by an online adversary; the complexity measure is total work (number of steps taken by all processors). Despite their differences the two models share key algorithmic techniques. We present new algorithms for the WriteAll problem (in which P processors write ones into an array of size N ) for the two models. These algorithms can be used to implement a simulation strategy for any N ...
A comparison of shared and nonshared memory models of parallel computation
 Proceedings of the IEEE
, 1991
"... Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these t ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
Four algorithms are analyzed in the shared and nonshared (distributed) memory models ofparallel computation. The analysis shows that the shared memory model predicts optimality for algorithms and programming styles that cannot be realized on any physical parallel computers. Programs based on these techniques are inferior to programs wrinen in the nonshared memory model. The “unit ” cost charged for a reference to shared memory is argued to be the source of the shared memory model’s inaccuracy. The implications of these observations are discussed. I.
Evolving Information Processing Organizations
 Tournament Selection and the Effects of Noise”, Complex Systems 9
, 1995
"... The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
The organization of information processing resources is a central question in economic, organizational, and computational theory. Recent work by Radner (1992) and others has developed a simple theoretical framework and some useful formal mathematical results about the behavior of such systems. Here, we follow a complementary computational approach that allows us to pursue questions concerning the impact of coordination and various exogenous conditions facing the organization. We find that organizations demonstrate "order for free," that is, given a simple structural framework and a set of standard operating procedures, even randomly generated organizations imply welldefined patterns of behavior. Using a genetic algorithm, we also show that simple evolutionary processes allow organizations to "learn" better structures. I am grateful to Jody Lutz and, especially, Hollis Schuler for research assistance, and to Dean Behrens, Kathleen Carley, and Robyn Dawes for useful comments. I would ...
Empirical Study Design in the Area of High Performance Computing (HPC)
 IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING. NOOSA HEADS
"... The development of HighPerformance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant st ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
The development of HighPerformance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant study. Because of key differences between development for HPC and for more conventional software engineering applications, this work has required the tailoring of experimental designs and protocols. A major contribution of our work is to begin to quantify the code development process in a specialized area that has previously not been extensively studied. Specifically, we present an analysis of the domain of HighPerformance Computing for the aspects that would impact experimental design; show how those aspects are reflected in experimental design for this specific area; and demonstrate how we are using such experimental designs to build up a body of knowledge specific to the domain. Results to date build confidence in our approach by showing that there are no significant differences across studies comparing subjects with similar experience tackling similar problems, while there are significant differences in performance and effort among the different parallel models applied.
Cycles in Networks
, 1993
"... We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path P ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We study the presence of cycles and long paths in graphs that have been proposed as interconnection networks for parallel architectures. The study surveys and complements known results. 1 Introduction This paper is devoted to studying embeddings of the simplest possible guest graphs, the path PN and the cycle CN , in graphs that have been proposed as interconnection networks for parallel architectures. In addition to their intrinsic interest, in terms of the development of algorithms on parallel architectures, these two guest graphs are important because of the fact that many structurally richer graphs can be constructed from paths and cycles by various product constructions. A few of the results we present are original; several appear in the literature and are duly cited; many belong to the folklore of the field. Indeed this paper is motivated by a desire to find a single repository for this important, yet scattered material. Before proceeding further, we define formally the ...